Finding the Longest Multiple Repeat solved by 577

Aug. 23, 2012, midnight by btjaden

Topics: Graph Algorithms

Long Repeats

We saw in “Introduction to Pattern Matching” that a data structure commonly used to encode the relationships among a collection of strings was the trie, which is particularly useful when the strings represent a collection of patterns that we wish to match to a larger text.

The trie is helpful when processing multiple strings at once, but when we want to analyze a single string, we need something different.

In this problem, we will use a new data structure to handle the problem of finding long repeats in the genome. Recall from “Finding a Motif in DNA” that cataloguing these repeats is a problem of the utmost interest to molecular biologists, as a natural correlation exists between the frequency of a repeat and its influence on RNA transcription. Our aim is therefore to identify long repeats that occur more than some predetermined number of times.


Figure 1. The suffix tree for s = GTCCGAAGCTCCGG. Note that the dollar sign has been appended to a substring of the tree to mark the end of s. Every path from the root to a leaf corresponds to a unique suffix of GTCCGAAGCTCCGG, and each leaf is labeled with the location in s of the suffix ending at that leaf.

A repeated substring of a string $s$ of length $n$ is simply a substring that appears in more than one location of $s$; more specifically, a k-fold substring appears in at least k distinct locations.

The suffix tree of $s$, denoted $T(s)$, is defined as follows:

See Figure 1 for an example of a suffix tree.

Given: A DNA string $s$ (of length at most 20 kbp) with $ appended, a positive integer $k$, and a list of edges defining the suffix tree of $s$. Each edge is represented by four components:

  1. the label of its parent node in $T(s)$;
  2. the label of its child node in $T(s)$;
  3. the location of the substring $t$ of $s^{*}$ assigned to the edge; and
  4. the length of $t$.

Return: The longest substring of $s$ that occurs at least $k$ times in $s$. (If multiple solutions exist, you may return any single solution.)

Sample Dataset

node1 node2 1 1
node1 node7 2 1
node1 node14 3 3
node1 node17 10 1
node2 node3 2 4
node2 node6 10 1
node3 node4 6 5
node3 node5 10 1
node7 node8 3 3
node7 node11 5 1
node8 node9 6 5
node8 node10 10 1
node11 node12 6 5
node11 node13 10 1
node14 node15 6 5
node14 node16 10 1

Sample Output



How can repeated substrings of $s$ be located in $T(s)$?

Please login to solve this problem.