Dec. 4, 2012, 7:06 a.m. by Rosalind Team
More Random Strings
In “Introduction to Random Strings”, we discussed searching for motifs in large genomes, in which random occurrences of the motif are possible. Our aim is to quantify just how frequently random motifs occur.
One class of motifs of interest are promoters, or regions of DNA that initiate the transcription of a gene. A promoter is usually located shortly before the start of its gene, and it contains specific intervals of DNA that provide an initial binding site for RNA polymerase to initiate transcription. Finding a promoter is usually the second step in gene prediction after establishing the presence of an ORF (see “Open Reading Frames”).
Unfortunately, there is no quick rule for identifying promoters. In Escherichia coli, the promoter contains two short intervals (TATAAT and TTGACA), which are respectively located 10 and 35 base pairs upstream from the beginning of the gene's ORF. Yet even these two short intervals are consensus strings (see “Consensus and Profile”): they represent average-case strings that are not found intact in most promoters. Bacterial promoters further vary in that some contain additional intervals used to bind to specific proteins or to change the intensity of transcription.
Eukaryotic promoters are even more difficult to characterize. Most have a TATA box (consensus sequence: TATAAA), preceded by an interval called a B recognition element, or BRE. These elements are typically located within 40 bp of the start of transcription. For that matter, eukaryotic promoters can hold a larger number of additional "regulatory" intervals, which can be found as far as several thousand base pairs upstream of the gene.
Our aim in this problem is to determine the probability with which a given motif (a known promoter, say) occurs in a randomly constructed genome. Unfortunately, finding this probability is tricky; instead of forming a long genome, we will form a large collection of smaller random strings having the same length as the motif; these smaller strings represent the genome's substrings, which we can then test against our motif.
Given a probabilistic event
For a simple example, if
Given: A positive integer
Return: The probability that if
90000 0.6 ATAGCCGA