July 2, 2012, midnight by Mikhail Dvorkin
Topics: Genome Assembly
How Well Assembled Are Our Contigs?
As we have stated, the goal of genome sequencing is to create contigs that are as long as possible. Thus, after fragment assembly, it is important to possess statistics quantifying how well-assembled our contigs are.
First and foremost, we demand a measure of what percentage of the assembled genome is made up of long contigs. Our first question is then: if we select contigs from our collection, how long do the contigs need to be to cover 50% of the genome?
Given a collection of DNA strings representing contigs, we use the N
statistic NXX (where XX ranges from 01 to 99) to represent the maximum positive integer
Given: A collection of at most 1000 DNA strings (whose combined length does not exceed 50 kbp).
Return: N50 and N75 for this collection of strings.
GATTACA TACTACTAC ATTGAT GAAGA
For an explanation of the results obtained in the sample above, contigs of length at least 7 total 7 + 9 = 16 bp, which is more than 50% of the total 27). Contigs of length at least 8 total only 9 bp (less than 50%).
Contigs of length at least 6 total 6 + 7 + 9 = 22 bp, which is more than 75% of all base pairs. Contigs of length at least 7 total only 16 bp (less than 75%).