Figure 1. The 2-mer composition of TTGATTACCTTATTTGATCATTACACATTGTACGCTTGTGTCAAAATATCACATGTGCCT
A length ksubstring of a genetic string is commonly called a k-mer.
A genetic string of length n can be seen as composed of n−k+1 overlapping k-mers.
The k-mer composition of a genetic string encodes
the number of times that each possible k-mer occurs in the string.
See Figure 1.
The 1-mer composition
is a generalization of the GC-content of a strand of DNA, and the 2-mer, 3-mer, and 4-mer compositions
of a DNA string are also commonly known as its di-nucleotide, tri-nucleotide, and tetra-nucleotide compositions.
The biological significance of k-mer composition is manyfold. GC-content is helpful not only
in helping to identify a piece of unknown DNA (see “Computing GC Content”), but also because a genomic
region having high GC-content compared to the rest of the genome signals that it may
belong to an exon. Analyzing k-mer composition is vital to fragment assembly as well.
In “Computing GC Content”, we also drew an analogy between
analyzing the frequency of characters and identifying the underlying language. For larger
values of k, the k-mer composition offers a more robust fingerprint of a string's identity because it
offers an analysis on the scale of substrings (i.e., words) instead of that of single symbols.
As a basis of comparison, in language analysis, the k-mer composition of a text can be
used not only to pin down the language, but also often the author.
Problem
For a fixed positive integer k, order all possible k-mers taken from an underlying alphabet lexicographically.
Then the k-mer composition of a string s can be represented by
an arrayA for which A[m] denotes the number of times that the mth
k-mer (with respect to the lexicographic order) appears in s.