ROSALIND | Glossary | Linguistic complexity

The linguistic complexity of a string $s$ of length $n$ formed over an alphabet of size $a$ (denoted $\textrm{lc}(s)$) is equal to the total number of distinct substrings appearing in $s$ (denoted $\textrm{sub}(s)$) divided by the maximum substring count (denoted $m(a, n)$); the maximum substring count is the total number of distinct substrings that could theoretically appear in a string of length $n$ formed over an alphabet of size $a$.

Note that we have the bounds $0 < \textrm{lc}(s) \leq 1$, with smaller values of $\textrm{lc}(s)$ indicating that $s$ is more repetitive.

As an example, consider the DNA string (alphabet size $a = 4$) given by $s = \textrm{ATTTGGATT}$. In the following table, we demonstrate that $\textrm{lc}(s) = \frac{35}{40} = 0.875$ by considering the number of observed and possible length $k$ substrings of $s$ for each $k$, which are denoted by $\textrm{sub}_{k}(s)$ and $m(a, k, n)$, respectively. Accordingly, $m(a, n) = \sum_{k=1}^{n}{m(a,k,n)} = 35$ and $\textrm{sub}(s)= \sum_{k=1}^{n}\textrm{sub}_{k}(s) = 40$.

$k$	$\textrm{sub}_{k}(s)$	$m(a, k, n)$
1	3	4
2	5	8
3	6	7
4	6	6
5	5	5
6	4	4
7	3	3
8	2	2
9	1	1
Total	35	40

Glossary

Linguistic complexity

Report a typo