Processing math: 100%

Genome Assembly with Perfect Coverage and Repeats solved by 285

Dec. 4, 2012, 7:12 a.m. by Rayan

Topics: Genome Assembly, Graph Algorithms

Repeats: A Practical Assembly Difficultyclick to expand

Genome assembly is straightforward if we know in advance that the de Bruijn graph has exactly one directed cycle (see “Genome Assembly with Perfect Coverage”).

In practice, a genome contains repeats longer than the length of the k-mers that we wish to use to assemble the genome. Such repeats increase the number of cycles present in the de Bruijn graph for these k-mers, thus preventing us from assembling the genome uniquely.

For example, consider the circular string (ACCTCCGCC), along with a collection S of error-free reads of length 3, exhibiting perfect coverage and taken from the same strand of an interval of DNA. The corresponding de Bruijn graph B2 (where edges correspond to 3-mers and nodes correspond to 2-mers) has at least two directed cycles: one giving the original circular string (ACCTCCGCC), and another corresponding to the misfit (ACCGCCTCC).

Also, note that these cycles are not simple cycles, as the node corresponding to "CC" is visited three times in each cycle.

To generalize the problem of genome assembly from a de Bruijn graph to the case of genomes containing repeats, we therefore must add a constraint: in a cycle corresponding to a valid assembly, every 3-mer must appear as many times in the cycle as it does in our collection of reads (which correspond to all 3-mers in the original string).

Problem

Recall that a directed cycle is a cycle in a directed graph in which the head of one edge is equal to the tail of the following edge.

In a de Bruijn graph of k-mers, a circular string s is constructed from a directed cycle s1s2...sis1 is given by s1+s2[k]+...+sik[k]+sik+1[k]. That is, because the final k1 symbols of s1 overlap with the first k1 symbols of s2, we simply tack on the k-th symbol of s2 to s, then iterate the process.

For example, the circular string assembled from the cycle "AC" "CT" "TA" "AC" is simply (ACT). Note that this string only has length three because the 2-mers "wrap around" in the string.

If every k-mer in a collection of reads occurs as an edge in a de Bruijn graph cycle the same number of times as it appears in the reads, then we say that the cycle is "complete."

Given: A list Sk+1 of error-free DNA (k+1)-mers (k5) taken from the same strand of a circular chromosome (of length 50).

Return: All circular strings assembled by complete cycles in the de Bruijn graph Bk of Sk+1. The strings may be given in any order, but each one should begin with the first (k+1)-mer provided in the input.

Sample Dataset

CAG
AGT
GTT
TTT
TTG
TGG
GGC
GCG
CGT
GTT
TTC
TCA
CAA
AAT
ATT
TTC
TCA

Sample Output

CAGTTCAATTTGGCGTT
CAGTTCAATTGGCGTTT
CAGTTTCAATTGGCGTT
CAGTTTGGCGTTCAATT
CAGTTGGCGTTCAATTT
CAGTTGGCGTTTCAATT

Please login to solve this problem.