Suggested problems

Coverage in Genome Sequencing

Jan. 11, 2013, 12:04 a.m. by Arthi Ramachandran

An Introduction to High-Throughput Genome Sequencing

In "Genome Assembly as Shortest Superstring", we saw how genomes are reconstructed from a series of shorter reads. Modern day sequencing technologies produce these reads by fragmenting the genome into random fragments and then sequencing these fragments to produce reads. Multiple rounds of fragmentation and sequencing results in multiple overlapping reads.

Coverage is defined as the average number of reads overlapping a single position. Coverage estimates provide information about how much information we have for the genome.

...

Problem

Consider a genome of size 3 billion base pairs (the size of the human genome). The genome is randomly fragmented, and hence there is a probability distribution that defines the likelihood of finding a certain number of reads at a position.

If $p$ is the probability of no read covering a specific position, then we can consider an indicator variable, $I$

$I = \begin{cases} p, & \mbox{if there is no read covering the position}\\ 1-p, & \mbox{if there is} \ge 1 \mbox{ read covering the position}\end{cases}$

When considering a length of the genome, $L$, the number of positions with no reads is $\sum_{L}{I}$ which is binomially distributed

Hint:

  1. Think about how to find the probability of not finding any reads for a single position.
  2. The normal distribution can be used to approximate the binomial distribution

Given: A positive number of short reads $N$ (in millions), an integer length $r$ of each read, a length $L$ Mbp (Mega base pairs) of the genome

Return: A probability $0 \le p \le 1$ that the given length, $L$, across the genome is not covered by any reads. $p$ should be rounded to 5 decimal places

Sample Dataset

400 40 7.45

Sample Output

0.00204