Unfortunately, constructing a phylogeny from the ground up based only on an alignment can
be difficult. In order to produce an efficient solution, we will need to assume
that the structure of the phylogeny has already been provided (perhaps from character-based methods),
and our aim instead is to reconstruct the genetic strings corresponding to the internal nodes
(i.e., ancestors) in the tree.
The ancestor strings should have the property that the total
number of point mutations separating adjacent nodes in the tree is minimized
(in keeping with parsimony).
Problem
Say that we have ntaxa represented by stringss1,s2,…,sn with
a multiple alignment inducing corresponding augmented strings¯s1,¯s2,…,¯sn.
Recall that the number of single-symbol substitutions required to transform one string into
another is the Hamming distance between the strings (see “Counting Point Mutations”). Say that we have a rooted binary treeT
containing ¯s1,¯s2,…,¯sn at its leaves and additional strings
¯sn+1,¯sn+2,…,¯s2n−1 at its internal nodes,
including the root (the number of internal nodes is n−1 by extension of “Counting Phylogenetic Ancestors”).
Define dH(T) as the sum of dH(¯si,¯sj) over
all edges{¯si,¯sj} in T:
dH(T)=∑{¯si,¯sj}∈E(T)dH(¯si,¯sj)
Thus, our aim is to minimize dH(T).
Given: A rooted binary tree T on n (n≤500) species, given in Newick format, followed by
a multiple alignment of m (m≤n) augmented DNA strings having the same length (at most 300 bp)
corresponding to the species and given in FASTA format.
Return: The minimum possible value of dH(T), followed by a collection of DNA strings
to be assigned to the internal nodes of T that will minimize dH(T)
(multiple solutions will exist, but you need only output one).
Sample Dataset
(((ostrich,cat)rat,(duck,fly)mouse)dog,(elephant,pikachu)hamster)robot;
>ostrich
AC
>cat
CA
>duck
T-
>fly
GC
>elephant
-T
>pikachu
AA
Sample Output
8
>rat
AC
>mouse
TC
>dog
AC
>hamster
AT
>robot
AC
Noteclick to expand
Given internal strings minimizing dH(T), the alignment between any two
adjacent strings is not necessarily an optimal global paired alignment. In other words,
it may not be the case that dH(¯si,¯sj) is equal to
the edit distancedE(si,sj).