Aug. 14, 2012, midnight by Mikhail Dvorkin
Topics: Genome Assembly, Graph Algorithms
Recall that although chromosomes taken from eukaryotes have a linear structure, many bacterial chromosomes are actually circular. We represented a linear chromosome with a DNA string, so we only need to modify the definition of string to model circular chromosomes.
Perfect coverage is the phenomenon in fragment assembly of having a read (or
$k$-mer) begin at every possible location in the genome. Unfortunately, perfect coverage is still difficult to achieve, but fragment assembly technology continues to improve by leaps and bounds, and perfect coverage is perhaps not the fantasy it once was.
A circular string is a string that does not have an initial or terminal element; instead, the string is viewed as a necklace of symbols. We can represent a circular string as a string enclosed in parentheses. For example, consider the circular DNA string (ACGTAC), and note that because the string "wraps around" at the end, this circular string can equally be represented by (CGTACA), (GTACAC), (TACACG), (ACACGT), and (CACGTA). The definitions of substrings and superstrings are easy to generalize to the case of circular strings (keeping in mind that substrings are allowed to wrap around).
Given: A collection of (error-free) DNA
Return: A cyclic superstring of minimal length containing the reads (thus corresponding to a candidate cyclic chromosome).
ATTAC TACAG GATTA ACAGA CAGAT TTACA AGATT
The assumption made above that all reads derive from the same strand is practically unrealistic; in reality, researchers will not know the strand of DNA from which a given read has been sequenced.