# Glossary

## Linguistic complexity

The linguistic complexity of a string $s$ of length $n$ formed over an alphabet of size $a$ (denoted $\textrm{lc}(s)$) is equal to the total number of distinct substrings appearing in $s$ (denoted $\textrm{sub}(s)$) divided by the maximum substring count (denoted $m(a, n)$); the maximum substring count is the total number of distinct substrings that could theoretically appear in a string of length $n$ formed over an alphabet of size $a$.

Note that we have the bounds $0 < \textrm{lc}(s) \leq 1$, with smaller values of $\textrm{lc}(s)$ indicating that $s$ is more repetitive.

As an example, consider the DNA string (alphabet size $a = 4$) given by $s = \textrm{ATTTGGATT}$. In the following table, we demonstrate that $\textrm{lc}(s) = \frac{35}{40} = 0.875$ by considering the number of observed and possible length $k$ substrings of $s$ for each $k$, which are denoted by $\textrm{sub}_{k}(s)$ and $m(a, k, n)$, respectively. Accordingly, $m(a, n) = \sum_{k=1}^{n}{m(a,k,n)} = 35$ and $\textrm{sub}(s)= \sum_{k=1}^{n}\textrm{sub}_{k}(s) = 40$.

 $k$ $\textrm{sub}_{k}(s)$ $m(a, k, n)$ 1 3 4 2 5 8 3 6 7 4 6 6 5 5 5 6 4 4 7 3 3 8 2 2 9 1 1 Total 35 40