We have seen that every genome contains a large number of repeats and noted that the Alu
repeat recurs around a million times on the human genome. Yet exactly how repetitive
is the human genome?
To frame such a vague question mathematically, we first need to make the observation that if the genome were
formed by adding nucleobases randomly, with each base having a 1/4 probability of being
added at each nucleotide position, then we should expect to see a huge number
of different substrings in the genome. Yet (to take a simple case) the genome containing
only adenosine and forming the DNA string "AAAAAA...AAA" has relatively very few
distinct substrings.
Now, real genomes are formed by a process that chooses nucleotides somewhere
in between these two extreme cases, and so to quantify just how random this process is,
we need to take the percentage of distinct substrings appearing in a genome with respect
to the maximum possible number of distinct substrings that could appear in a genome of the same
length.
Problem
Given a length n string s formed over an alphabetA of size a, let
the "substring count" sub(s) denote the total number of distinct substrings of s.
Furthermore, let the "maximum substring count" m(a,n) denote the maximum number of distinct substrings
that could appear in a string of length n formed over A.
The linguistic complexity of s (written lc(s)) is equal to sub(s)m(a,n);
in other words, lc(s) represents the percentage of observed substrings of s to
the total number that are theoretically possible. Note that 0<lc(s)<1, with smaller
values of lc(s) indicating that s is more repetitive.
As an example, consider the DNA string (a=4) s=ATTTGGATT. In the following
table, we demonstrate that lc(s)=3540=0.875 by considering
the number of observed and possible length k substrings of s, which are denoted by
subk(s) and m(a,k,n), respectively.
(Observe that m(a,n)=∑nk=1m(a,k,n)=40 and sub(s)=∑nk=1subk(s)=35.)