Processing math: 100%

Glossary

Linguistic complexity

The linguistic complexity of a string s of length n formed over an alphabet of size a (denoted lc(s)) is equal to the total number of distinct substrings appearing in s (denoted sub(s)) divided by the maximum substring count (denoted m(a,n)); the maximum substring count is the total number of distinct substrings that could theoretically appear in a string of length n formed over an alphabet of size a.

Note that we have the bounds 0<lc(s)1, with smaller values of lc(s) indicating that s is more repetitive.

As an example, consider the DNA string (alphabet size a=4) given by s=ATTTGGATT. In the following table, we demonstrate that lc(s)=3540=0.875 by considering the number of observed and possible length k substrings of s for each k, which are denoted by subk(s) and m(a,k,n), respectively. Accordingly, m(a,n)=nk=1m(a,k,n)=35 and sub(s)=nk=1subk(s)=40.

k subk(s) m(a,k,n)
1 3 4
2 5 8
3 6 7
4 6 6
5 5 5
6 4 4
7 3 3
8 2 2
9 1 1
Total 35 40

Wikipedia