Glossary

Linguistic complexity

The linguistic complexity of a string $s$ of length $n$ formed over an alphabet of size $a$ (denoted $\textrm{lc}(s)$) is equal to the total number of distinct substrings appearing in $s$ (denoted $\textrm{sub}(s)$) divided by the maximum substring count (denoted $m(a, n)$); the maximum substring count is the total number of distinct substrings that could theoretically appear in a string of length $n$ formed over an alphabet of size $a$.

Note that we have the bounds $0 < \textrm{lc}(s) \leq 1$, with smaller values of $\textrm{lc}(s)$ indicating that $s$ is more repetitive.

As an example, consider the DNA string (alphabet size $a = 4$) given by $s = \textrm{ATTTGGATT}$. In the following table, we demonstrate that $\textrm{lc}(s) = \frac{35}{40} = 0.875$ by considering the number of observed and possible length $k$ substrings of $s$ for each $k$, which are denoted by $\textrm{sub}_{k}(s)$ and $m(a, k, n)$, respectively. Accordingly, $m(a, n) = \sum_{k=1}^{n}{m(a,k,n)} = 35$ and $\textrm{sub}(s)= \sum_{k=1}^{n}\textrm{sub}_{k}(s) = 40$.

$k$ $\textrm{sub}_{k}(s)$ $m(a, k, n)$
1 3 4
2 5 8
3 6 7
4 6 6
5 5 5
6 4 4
7 3 3
8 2 2
9 1 1
Total 35 40

Wikipedia