# Finding the Longest Multiple Repeat solved by 492

Aug. 23, 2012, midnight by btjaden

Topics: Graph Algorithms

## Long Repeats

We saw in “Introduction to Pattern Matching” that a data structure commonly used to encode the relationships among a collection of strings was the trie, which is particularly useful when the strings represent a collection of patterns that we wish to match to a larger text.

The trie is helpful when processing multiple strings at once, but when we want to analyze a single string, we need something different.

In this problem, we will use a new data structure to handle the problem of finding long repeats in the genome. Recall from “Finding a Motif in DNA” that cataloguing these repeats is a problem of the utmost interest to molecular biologists, as a natural correlation exists between the frequency of a repeat and its influence on RNA transcription. Our aim is therefore to identify long repeats that occur more than some predetermined number of times.

## Problem

Figure 1. The suffix tree for s = GTCCGAAGCTCCGG. Note that the dollar sign has been appended to a substring of the tree to mark the end of s. Every path from the root to a leaf corresponds to a unique suffix of GTCCGAAGCTCCGG, and each leaf is labeled with the location in s of the suffix ending at that leaf.

A repeated substring of a string $s$ of length $n$ is simply a substring that appears in more than one location of $s$; more specifically, a k-fold substring appears in at least k distinct locations.

The suffix tree of $s$, denoted $T(s)$, is defined as follows:

• $T(s)$ is a rooted tree having exactly $n$ leaves.
• Every edge of $T(s)$ is labeled with a substring of $s^*$, where $s^*$ is the string formed by adding a placeholder symbol $ to the end of $s$. • Every internal node of $T(s)$other than the root has at least two children; i.e., it has degree at least 3. • The substring labels for the edges leading from a node to its children must begin with different symbols. • By concatenating the substrings along edges, each path from the root to a leaf corresponds to a unique suffix of $s^*$. See Figure 1 for an example of a suffix tree. Given: A DNA string $s$(of length at most 20 kbp) with $ appended, a positive integer $k$, and a list of edges defining the suffix tree of $s$. Each edge is represented by four components:

1. the label of its parent node in $T(s)$;
2. the label of its child node in $T(s)$;
3. the location of the substring $t$ of $s^{*}$ assigned to the edge; and
4. the length of $t$.

Return: The longest substring of $s$ that occurs at least $k$ times in $s$. (If multiple solutions exist, you may return any single solution.)

CATACATAC$2 node1 node2 1 1 node1 node7 2 1 node1 node14 3 3 node1 node17 10 1 node2 node3 2 4 node2 node6 10 1 node3 node4 6 5 node3 node5 10 1 node7 node8 3 3 node7 node11 5 1 node8 node9 6 5 node8 node10 10 1 node11 node12 6 5 node11 node13 10 1 node14 node15 6 5 node14 node16 10 1  ## Sample Output CATAC  ## Hint How can repeated substrings of $s$be located in $T(s)\$?

Please login to solve this problem.