In “Genome Assembly as Shortest Superstring”, we introduce the problem of assembling a genome from a collection
of reads. Even though genome sequencing is a multi-billion dollar enterprise,
sequencing machines that identify reads still produce errors a substantial percentage of
the time. To make matters worse, these errors are unpredictable; it is difficult to
determine if the machine has made an error, let alone where in the read the error has
occurred. For this reason, error correction in reads is typically a vital first step in
genome assembly.
Problem
As is the case with point mutations, the most common type of sequencing error occurs
when a single nucleotide from a read is interpreted incorrectly.
Given: A collection of up to 1000 reads of equal length (at most 50 bp) in FASTA format.
Some of these reads were generated with a single-nucleotide error.
For each read s in the dataset, one of the following applies:
s was correctly sequenced and appears in the dataset at least twice (possibly as a reverse complement);
s is incorrect, it appears in the dataset exactly once, and its Hamming distance is 1 with respect to
exactly one correct read in the dataset (or its reverse complement).
Return: A list of all corrections in the form "[old read]->[new read]". (Each correction
must be a single symbol substitution, and you may return the corrections in any order.)