Because we use multiple copies of the genome to generate and identify reads for the purposes
of fragment assembly, the total length of all reads will be much longer than the genome itself.
This begs the definition of read coverage as the average number of times that each
nucleotide from the genome appears in the reads. In other words, if the total length
of our reads is 30 billion bp for a 3 billion bp genome, then we have 10x
read coverage.
To handle such a large number of k-mers for the purposes of sequencing the genome, we need
an efficient and simple structure.
Problem
Consider a setS of (k+1)-mers of some unknown DNA string. Let Src
denote the set containing all reverse complements of the elements of S. (recall from “Counting Subsets” that sets
are not allowed to contain duplicate elements).
The de Bruijn graphBk of order k corresponding to S∪Src is a digraph defined in the following way:
Nodes of Bk correspond to all k-mers that are present as a substring of a (k+1)-mer from S∪Src.
Edges of Bk are encoded by the (k+1)-mers of S∪Src in the following way:
for each (k+1)-mer r in S∪Src, form a directed edge (r[1:k], r[2:k+1]).
Given: A collection of up to 1000 (possibly repeating) DNA strings of equal length (not exceeding 50 bp) corresponding to a set S of (k+1)-mers.
Return: The adjacency list corresponding to the de Bruijn graph corresponding to S∪Src.