Genome assembly

Genome assembly is the computational process in which the actual "sequence" of nucleotides making up an organism's genome is determined. Researchers are not presently able to read the nucleotides of an entire chromosome one at a time; instead, they rely on high-tech chemical methods to determine the order of bases along short strands of DNA (i.e., less than 1,000 bp). These short strands are called reads.

To sequence a genome, then, researchers chemically "blow up" multiple copies of a genome (typically taken from multiple cells of the same organism) into reads, then determine the nature of the reads. To reassemble the genome, they must use overlaps in reads to determine which read pairs are adjacent in the genome. For example, if the short reads "GGACTAG" and "GACTAGAA" are produced, then we might surmise from the fact that they overlap that they both belong to a substring "GGACTAGAA".

An efficient algorithm for assembly still does not exist that can handle all possible wrinkles and complications arising from practical concerns (such as errors in reads), so that research in genome assembly remains a highly competitive field.