How Well Assembled Are Our Contigs?click to expand
As we have stated, the goal of genome sequencing is to create contigs that are
as long as possible. Thus, after fragment assembly, it is important to
possess statistics quantifying how well-assembled our contigs are.
First and foremost, we demand a measure of what percentage of the assembled genome
is made up of long contigs. Our first question is then: if we select contigs from our collection,
how long do the contigs need to be to cover 50% of the genome?
Problem
Given a collection of DNA strings representing contigs, we use the N
statistic NXX (where XX ranges from 01 to 99) to represent the maximum positive integer L
such that the total number of nucleotides of all contigs
having length ≥L is at least XX% of the sum of contig lengths. The most commonly used
such statistic is N50, although N75 is also worth mentioning.
Given: A collection of at most 1000 DNA strings
(whose combined length does not exceed 50 kbp).
Return: N50 and N75 for this collection of strings.
Sample Dataset
GATTACA
TACTACTAC
ATTGAT
GAAGA
Sample Output
7 6
Extra Informationclick to expand
For an explanation of the results obtained in the sample above,
contigs of length at least 7 total 7 + 9 = 16 bp, which is more than 50% of the total 27).
Contigs of length at least 8 total only 9 bp (less than 50%).
Contigs of length at least 6 total 6 + 7 + 9 = 22 bp, which is more than 75% of all base pairs.
Contigs of length at least 7 total only 16 bp (less than 75%).