March 18, 2014, 8:26 p.m. by Rosalind Team
We hope that you have noticed what is still a limitation of BETTERBWMATCHING from “Implement BetterBWMatching” — even though this algorithm counts the number of occurrences of Pattern in Text, it does not tell us where these occurrences are located in Text! To locate pattern matches identified by BETTERBWMATCHING, we can once again use the suffix array.
However, recall that our original motivation for using the Burrows-Wheeler transform was to reduce the amount of memory used by the suffix array for pattern matching. If we add the suffix array to Burrows-Wheeler-based pattern matching, then we are right back where we started!
The memory-saving device that we will employ is inelegant but useful. We will build a partial suffix array of Text, denoted SuffixArrayK(Text), which only contains values that are multiples of some positive integer K (see “Construct the Partial Suffix Array of a String”). In real applications, partial suffix arrays are often constructed for K = 100, thus reducing memory usage by a factor of 100 compared to a full suffix array.
We should also discuss how to improve BETTERBWMATCHING (reproduced below) by resolving the trade-off between precomputing the values of Countsymbol(i, LastColumn) (requiring substantial memory) and computing these values as we go (requiring substantial runtime).
The balance that we strike is similar to the one used for the partial suffix array. Rather than storing Countsymbol(i, LastColumn) for all positions i, we will only store the Count arrays when i is divisible by C, where C is a constant; these arrays are called checkpoint arrays. When C is large (C is typically equal to 100 in practice) and the alphabet is small (e.g., 4 nucleotides), checkpoint arrays require only a fraction of the memory used by BWT(Text).
What about runtime? Using checkpoint arrays, we can compute the top and bottom pointers in a constant number of steps (i.e., fewer than C). Since each Pattern requires at most |Pattern| pointer updates, the modified BETTERBWMATCHING algorithm now requires O(|Patterns|) runtime, which is the same as using a trie or suffix array.
Furthermore, we now only have to store the following data in memory: BWT(Text), FirstOccurrence, the partial suffix array, and the checkpoint arrays. Storing this data requires memory approximately equal to 1.5 · |Text|. Thus, we have finally knocked down the memory required for solving the Multiple Pattern Matching Problem for millions of sequencing reads into a workable range, and we can at last solve a full version of the Multiple Pattern Matching Problem.
Find all occurrences of a collection of patterns in a text.
Given: A string Text and a collection of strings Patterns.
Return: All starting positions in Text where a string from Patterns appears as a substring.
AATCGGGTTCAATCGGGGT ATCG GGGT
1 4 11 15