Dec. 4, 2012, 7:12 a.m. by Rayan
Topics: Genome Assembly, Graph Algorithms
Repeats: A Practical Assembly Difficulty
Genome assembly is straightforward if we know in advance that the de Bruijn graph has exactly one directed cycle (see “Genome Assembly with Perfect Coverage”).
In practice, a genome contains repeats longer than the length of the k-mers that we wish to use to assemble the genome. Such repeats increase the number of cycles present in the de Bruijn graph for these
$k$-mers, thus preventing us from assembling the genome uniquely.
For example, consider the circular string (ACCTCCGCC), along with a collection
$S$of error-free reads of length 3, exhibiting perfect coverage and taken from the same strand of an interval of DNA. The corresponding de Bruijn graph $B_2$(where edges correspond to 3-mers and nodes correspond to 2-mers) has at least two directed cycles: one giving the original circular string (ACCTCCGCC), and another corresponding to the misfit (ACCGCCTCC).
Also, note that these cycles are not simple cycles, as the node corresponding to "CC" is visited three times in each cycle.
To generalize the problem of genome assembly from a de Bruijn graph to the case of genomes containing repeats, we therefore must add a constraint: in a cycle corresponding to a valid assembly, every 3-mer must appear as many times in the cycle as it does in our collection of reads (which correspond to all 3-mers in the original string).
Recall that a directed cycle is a cycle in a directed graph in which the head of one edge is equal to the tail of the following edge.
In a de Bruijn graph of
For example, the circular string assembled from the cycle "AC"
Given: A list
Return: All circular strings assembled by complete cycles in the de Bruijn graph
CAG AGT GTT TTT TTG TGG GGC GCG CGT GTT TTC TCA CAA AAT ATT TTC TCA
CAGTTCAATTTGGCGTT CAGTTCAATTGGCGTTT CAGTTTCAATTGGCGTT CAGTTTGGCGTTCAATT CAGTTGGCGTTCAATTT CAGTTGGCGTTTCAATT