Suggested problems

Genome Assembly with Perfect Coverage and Repeats

Nov. 6, 2012, 12:43 a.m. by Rayan

Towards practical assembly difficulties

Genome assembly is straightforward if we know in advance that the de Bruijn graph has exactly one directed cycle (see "Genome Assembly with Perfect Coverage").

In practice, genomes contain repeats longer than the $k$-mer length, and such repeats increase the number of cycles present in the de Bruijn graph. In these cases, the assembly problem no longer has a single solution.

For example, consider the circular string ACCTCCGCC, along with a collection $S$ of error-free reads of length 3, exhibiting perfect coverage, taken from the same strand. The corresponding de Bruijn graph $B_2$ (where edges correspond to $3$-mers and nodes to $2$-mers) has at least two directed cycles: one that gives the original circular sequence ACCTCCGCC, and another corresponding to ACCGCCTCC.

Note that these cycles are not simple cycles, as the node CC appears three times in each cycle.

To generalize the problem of genome assembly from a de Bruijn graph, we can add a constraint: in a cycle corresponding to a valid assembly, each $3$-mer must appear as many times in the cycle as it does in the reads (which correspond to all $3$-mers in the original string). With this new constraint, ACCTCCGCC and ACCGCCTCC are the only two valid assemblies in our motivating example.

Problem

Recall that a directed cycle is a cycle in a directed graph in which the head of one edge is equal to the tail of the next.

In a de Bruijn graph of $k$-mers, the circular string $s$ constructed from a directed cycle $s_1 \rightarrow s_2 \rightarrow ... \rightarrow s_i \rightarrow s_1$ is given by $s_1 + s_2[k] + ... + s_{i-k-2}[k] + s_{i-k+1}[k]$. That is, because the final $k-1$ symbols of $s_1$ overlap with the first $k-1$ symbols of $s_2$, we simply tack on the $k$-th symbol of $s_2$ to $s$, then iterate.

For example, given a cycle AC -> CT -> TA -> AC, the circular sequence of this cycle is ACT.

Given: A list $S_{k+1}$ of error-free DNA $(k+1)$-mers ($k \leq 5$) taken from the same strand of a circular chromosome (of length $\leq 20$).

We do not consider reverse-complements in this problem.

Return: The circular sequences of all cycles, in the de Bruijn graph $B_k$ of $S_{k+1}$, for which the edges are permutations of $S_{k+1}$. In other words, each $(k+1)$-mer edge in the cycle has to appear exactly as many times as in $S_{k+1}$.

Furthermore, to normalize solutions, each circular sequence should start with the first $(k+1)$-mer given in the input.

Sample Dataset

ACC
CTC
CCT
CCG
TCC
CGC
GCC
CCA
CAC

Sample Output

ACCTCCGCC
ACCGCCTCC