Expected Number of Restriction Sites solved by 716

Dec. 4, 2012, 7:07 a.m. by tvinar

Topics: Probability

A Shot in the Dark

In “Locating Restriction Sites”, we first familiarized ourselves with restriction enzymes. Recall that these enzymes are used by bacteria to cut through both strands of viral DNA, thus disarming the virus: the viral DNA locations where these cuts are made are known as restriction sites. Recall also that every restriction enzyme is preprogrammed with a reverse palindromic interval of DNA to which it will bind and cut, called a recognition sequence. These even length intervals are usually either 4 or 6 base pairs long, although longer ones do exist; rare-cutter enzymes have recognition sequences of 8 or more base pairs.

In this problem, we will ask a simple question: how does the bacterium "know" that it will probably succeed in finding a restriction site within the virus's DNA? The answer is that the short length of recognition sequences guarantees a large number of matches occurring randomly.

Intuitively, we would expect for a recognition sequence of length 6 to occur on average once every $4^6 = 4,096$ base pairs. Note that this fact does not imply that the associated restriction enzyme will cut the viral DNA every 4,096 bp; it may find two restriction sites close together, then not find a restriction site for many thousand nucleotides.

In this problem, we will generalize the problem of finding an average number of restriction sites to take into account the GC-content of the underlying string being analyzed.


Say that you place a number of bets on your favorite sports teams. If their chances of winning are 0.3, 0.8, and 0.6, then you should expect on average to win 0.3 + 0.8 + 0.6 = 1.7 of your bets (of course, you can never win exactly 1.7!)

More generally, if we have a collection of events $A_1, A_2, \ldots, A_n$, then the expected number of events occurring is $\mathrm{Pr}(A_1) + \mathrm{Pr}(A_2) + \cdots + \mathrm{Pr}(A_n)$ (consult the note following the problem for a precise explanation of this fact). In this problem, we extend the idea of finding an expected number of events to finding the expected number of times that a given string occurs as a substring of a random string.

Given: A positive integer $n$ ($n \leq 1,000,000$), a DNA string $s$ of even length at most 10, and an array $A$ of length at most 20, containing numbers between 0 and 1.

Return: An array $B$ having the same length as $A$ in which $B[i]$ represents the expected number of times that $s$ will appear as a substring of a random DNA string $t$ of length $n$, where $t$ is formed with GC-content $A[i]$ (see “Introduction to Random Strings”).

Sample Dataset

0.25 0.5 0.75

Sample Output

0.422 0.563 0.422

The Mathematical Details

In this problem, we are speaking of an expected number of events; how can we tie this into the definition of expected value that we already have from “Calculating Expected Offspring”?

The answer relies on a slick mathematical trick. For any event $A$, we can form a random variable for $A$, called an indicator random variable $I_{A}$. For an outcome $x$, $I_{A}(x) = 1$ when $x$ belongs to $A$ and $I_{A}(x) = 0$ when $x$ belongs to $A^{\textrm{c}}$.

For an indicator random variable $I_{A}(x) = 1$, verify that $\mathrm{E}(I_{A}) = \mathrm{Pr}(A)$.

You should also verify from our original formula for expected value that for any two random variables $X$ and $Y$, $\mathrm{E}(X + Y)$ is equal to $\mathrm{E}(X) + \mathrm{E}(Y)$. As a result, the expected number of events $A_1, A_2, \ldots, A_m$ occurring, or $\mathrm{E}(I_{A_1} + I_{A_2} + \cdots + I_{A_m})$, reduces to $\mathrm{Pr}(A_1) + \mathrm{Pr}(A_2) + \cdots + \mathrm{Pr}(A_m)$.

Please login to solve this problem.