March 18, 2014, 8:26 p.m. by Rosalind Team
Our goal is to further improve on the memory requirements of the suffix array introduced in “Construct the Suffix Array of a String” for multiple pattern matching.
Given a string Text, form all possible cyclic rotations of Text; a cyclic rotation is defined by chopping off a suffix from the end of Text and appending this suffix to the beginning of Text. Next — similarly to suffix arrays — order all the cyclic rotations of Text lexicographically to form a |Text| × |Text| matrix of symbols that we call the Burrows-Wheeler matrix and denote by M(Text).
Note that the first column of M(Text) contains the symbols of Text ordered lexicographically. In turn, the second column of M(Text) contains the second symbols of all cyclic rotations of Text, and so it too represents a (different) rearrangement of symbols from Text. The same reasoning applies to show that any column of M(Text) is some rearrangement of the symbols of Text. We are interested in the last column of M(Text), called the Burrows-Wheeler transform of Text, or BWT(Text).
Construct the Burrows-Wheeler transform of a string.
Given: A string Text.
Return: BWT(Text).
GCGTGCCTGGTCA$
ACTGGCT$TGCGGC
Note
Although it is possible to construct the Burrows-Wheeler transform in O(|Text|) time and space, we do not expect you to implement such a fast algorithm. In other words, it is perfectly fine to produce BWT(Text) by first producing the complete Burrows-Wheeler matrix M(Text).