Jan. 11, 2013, 12:04 a.m. by Arthi Ramachandran
An Introduction to High-Throughput Genome Sequencing
In "Genome Assembly as Shortest Superstring", we saw how genomes are reconstructed from a series of shorter reads. Modern day sequencing technologies produce these reads by fragmenting the genome into random fragments and then sequencing these fragments to produce reads. Multiple rounds of fragmentation and sequencing results in multiple overlapping reads.
Coverage is defined as the average number of reads overlapping a single position. Coverage estimates provide information about how much information we have for the genome.
Consider a genome of size 3 billion base pairs (the size of the human genome). The genome is randomly fragmented, and hence there is a probability distribution that defines the likelihood of finding a certain number of reads at a position.
When considering a length of the genome,
Given: A positive number of short reads
Return: A probability
400 40 7.45