Aug. 23, 2012, midnight by btjaden

**Topics**:
Graph Algorithms

## Long Repeats

We saw in “Introduction to Pattern Matching” that a data structure commonly used to encode the relationships among a collection of strings was the trie, which is particularly useful when the strings represent a collection of patterns that we wish to match to a larger text.

The trie is helpful when processing multiple strings at once, but when we want to analyze a single string, we need something different.

In this problem, we will use a new data structure to handle the problem of finding long repeats in the genome. Recall from “Finding a Motif in DNA” that cataloguing these repeats is a problem of the utmost interest to molecular biologists, as a natural correlation exists between the frequency of a repeat and its influence on RNA transcription. Our aim is therefore to identify long repeats that occur more than some predetermined number of times.

A repeated substring of a string

The suffix tree of

$T(s)$ is a rooted tree having exactly$n$ leaves.- Every edge of
$T(s)$ is labeled with a substring of$s^*$ , where$s^*$ is the string formed by adding a placeholder symbol`$`

to the end of$s$ . - Every internal node of
$T(s)$ other than the root has at least two children; i.e., it has degree at least 3. - The substring labels for the edges leading from a node to its children must begin with different symbols.
- By concatenating the substrings along edges, each path from the root to a leaf corresponds to a unique suffix of
$s^*$ .

See Figure 1 for an example of a suffix tree.

Given: A DNA string `$`

appended,
a positive integer

- the label of its parent node in
$T(s)$ ; - the label of its child node in
$T(s)$ ; - the location of the substring
$t$ of$s^{*}$ assigned to the edge; and - the length of
$t$ .

Return: The longest substring of

CATACATAC$ 2 node1 node2 1 1 node1 node7 2 1 node1 node14 3 3 node1 node17 10 1 node2 node3 2 4 node2 node6 10 1 node3 node4 6 5 node3 node5 10 1 node7 node8 3 3 node7 node11 5 1 node8 node9 6 5 node8 node10 10 1 node11 node12 6 5 node11 node13 10 1 node14 node15 6 5 node14 node16 10 1

CATAC

## Hint

How can repeated substrings of

$s$ be located in$T(s)$ ?