Nov. 6, 2012, 12:43 a.m. by Rayan
Towards practical assembly difficulties
Genome assembly is straightforward if we know in advance that the de Bruijn graph has exactly one directed cycle (see "Genome Assembly with Perfect Coverage").
In practice, genomes contain repeats longer than the
$k$ -mer length, and such repeats increase the number of cycles present in the de Bruijn graph. In these cases, the assembly problem no longer has a single solution.For example, consider the circular string ACCTCCGCC, along with a collection
$S$ of error-free reads of length 3, exhibiting perfect coverage, taken from the same strand. The corresponding de Bruijn graph$B_2$ (where edges correspond to$3$ -mers and nodes to$2$ -mers) has at least two directed cycles: one that gives the original circular sequence ACCTCCGCC, and another corresponding to ACCGCCTCC.Note that these cycles are not simple cycles, as the node CC appears three times in each cycle.
To generalize the problem of genome assembly from a de Bruijn graph, we can add a constraint: in a cycle corresponding to a valid assembly, each
$3$ -mer must appear as many times in the cycle as it does in the reads (which correspond to all$3$ -mers in the original string). With this new constraint, ACCTCCGCC and ACCGCCTCC are the only two valid assemblies in our motivating example.
Recall that a directed cycle is a cycle in a directed graph in which the head of one edge is equal to the tail of the next.
In a de Bruijn graph of
For example, given a cycle AC -> CT -> TA -> AC, the circular sequence of this cycle is ACT.
Given: A list
We do not consider reverse-complements in this problem.
Return: The circular sequences of all cycles, in the de Bruijn graph
Furthermore, to normalize solutions, each circular sequence should start with the first
ACC CTC CCT CCG TCC CGC GCC CCA CAC
ACCTCCGCC ACCGCCTCC