Assessing Assembly Quality with N50 and N75 solved by 206

July 2, 2012, midnight by Mikhail Dvorkin

Topics: Genome Assembly

How Well Assembled Are Our Contigs?

As we have stated, the goal of genome sequencing is to create contigs that are as long as possible. Thus, after fragment assembly, it is important to possess statistics quantifying how well-assembled our contigs are.

First and foremost, we demand a measure of what percentage of the assembled genome is made up of long contigs. Our first question is then: if we select contigs from our collection, how long do the contigs need to be to cover 50% of the genome?

Problem

Given a collection of DNA strings representing contigs, we use the N statistic NXX (where XX ranges from 01 to 99) to represent the maximum positive integer $L$ such that the total number of nucleotides of all contigs having length $\ge L$ is at least XX% of the sum of contig lengths. The most commonly used such statistic is N50, although N75 is also worth mentioning.

Given: A collection of at most 1000 DNA strings (whose combined length does not exceed 50 kbp).

Return: N50 and N75 for this collection of strings.

Sample Dataset

GATTACA
TACTACTAC
ATTGAT
GAAGA

Sample Output

7 6

Extra Information

For an explanation of the results obtained in the sample above, contigs of length at least 7 total 7 + 9 = 16 bp, which is more than 50% of the total 27). Contigs of length at least 8 total only 9 bp (less than 50%).

Contigs of length at least 6 total 6 + 7 + 9 = 22 bp, which is more than 75% of all base pairs. Contigs of length at least 7 total only 16 bp (less than 75%).

Please login to solve this problem.