March 18, 2014, 8:26 p.m. by Rosalind Team
If you implemented BWMATCHING in “Implement BWMatching”, you probably found the algorithm to be slow. The reason for its sluggishness is that updating the pointers top and bottom is time-intensive, since it requires examining every symbol in LastColumn between top and bottom at each step. To improve BWMATCHING, we introduce a function Countsymbol(i, LastColumn), which returns the number of occurrences of symbol in the first i positions of LastColumn. For example, Count"n”(10, "smnpbnnaaaaa$a”) = 3, and Count"a”(4, "smnpbnnaaaaa$a”) = 0.
The green lines from BWMATCHING can be compactly described without the First-to-Last mapping by the following two lines:
Define FirstOccurrence(symbol) as the first position of symbol in FirstColumn. If Text = "panamabananas$", then FirstColumn is "$aaaaaabmnnnps", and the array holding all values of FirstOccurrence is [0, 1, 7, 8, 9, 11, 12]. For DNA strings of any length, the array FirstOccurrence contains only five elements.
The two lines of pseudocode from the previous step can now be rewritten as follows:
In the process of simplifying the green lines of pseudocode from BWMATCHING, we have also eliminated the need for both FirstColumn and LastToFirst, resulting in a more efficient algorithm called BETTERBWMATCHING.
Given: A string BWT(Text), followed by a collection of strings Patterns.
Return: A list of integers, where the i-th integer corresponds to the number of substring matches of the i-th member of Patterns in Text.
GGCGCCGC$TAGTCACACACGCCGTA ACC CCG CAG
1 2 1