Figure 1. Fish samples waiting for analysis in the lab of US Customs and Border Protection. (Copyright: Getty Images)
Figure 2. First 90 positions of a protein multiple sequence alignment of instances of the acidic ribosomal protein P0 from several organisms.
Recall problems from the ABC books where you have to find out the odd object. When you're dealing with nucleotide
strings you can face with the same problem - for example, if you need to detect the foreign admixture among the
samples. Such analysis nowadays is preformed in the customs of the many countries in order to
find the smuggling or fake food (passing tilapia off as salmon etc., see Figure 1)
In order to compare several homologous strings we need to align all of them simultaneously,
a procedure known as multiple sequence alignment, or MSA. Because it requires us to
compare more than two sequences at once, MSA is a more complicated problem than pairwise alignment.
In fact, finding a optimal alignment between more than a very few sequences is
so computationally intensive that many MSA programs rely instead on "quick and dirty" heuristic methods
that are guaranteed to provide a "good" solution but not necessarily the best possible one.
See Figure 2
for an example of a multiple alignment output.
Actually MSA programs based on the series of the pairwise alignments, and optimizes sum some score over all pairs of
characters in each position.
Problem
One of the first and commonly used programs for MSA is Clustal, developed by Des Higgins in 1988.
The current version using the same approach is called ClustalW2, and it is embedded in many software packages.
There is even a modification of ClustalW2 called ClustalX that provides a graphical user interface
for MSA.
See the link below for a convenient online interface that runs Clustal on the EBI website:
Select "Protein" or "DNA", then either paste your sequence in one of the listed formats or upload an entire file.
To obtain a more accurate alignment, leave Alignment type: slow selected: if you choose
to run Clustal on only two sequences, then the parameter options correspond to those
in Needle (see “Pairwise Global Alignment”).
Given: Set of nucleotide strings in FASTA format.
Return: ID of the string most different from the others.
Do a pairwise alignment. Program takes every pair of strings in the given set and finds the optimal global alignment
for the pair constructing the distance matrix.
Create a "guide tree". Program builds the bifurcating tree using distance matrix - it takes the closest pair,
adds the next closest string to that pair as a neighbour, and so on.
Use the guide tree to carry out a multiple alignment. Strings aligned progressively according to the hierarchy
in the guide tree.
You can download the CLustalW2 program from its homepage and run it
via the command line or using graphical interface ClustalX.
You can use installed ClustalW directly or via some wrappers: