The linguistic complexity of a string s of length n formed over an alphabet of size a
(denoted lc(s)) is equal to the total number of distinctsubstrings
appearing in s (denoted sub(s)) divided by the maximum substring count
(denoted m(a,n)); the maximum substring count is the
total number of distinct substrings that could theoretically appear in a string of length
n formed over an alphabet of size a.
Note that we have the bounds 0<lc(s)≤1, with smaller values of lc(s) indicating that s is more repetitive.
As an example, consider the DNA string (alphabet size a=4) given by s=ATTTGGATT.
In the following table, we demonstrate that lc(s)=3540=0.875 by considering
the number of observed and possible length k substrings of s for each k, which are denoted by
subk(s) and m(a,k,n), respectively. Accordingly, m(a,n)=∑nk=1m(a,k,n)=35
and sub(s)=∑nk=1subk(s)=40.