Introduction to Character-Based Phylogenyclick to expand
Before the modern genetics revolution, phylogenies were constructed from physical characters
resulting from direct structural comparison of taxa. A great deal of analysis relied on the
fossil record, as fossils provided the only concrete framework for studying the appearance
of extinct species and for inferring how they could have evolved into present-day organisms.
A classic case illustrating the utility of the fossil record
is the case of dinosaur pelvic bones. In 1887, Harry Seeley proposed a new
classification of dinosaurs into two orders, Saurischia and Ornithischia: the former
possessed hip bones shaped like those found in reptiles, whereas the latter had a much different
hip shape that resembled birds. Seeley's pelvic classification has survived to
the present day as the principal division of dinosaurs.
The key point is that hip bone shape is a physical character that separates all
dinosaurs into two different groups. Our hope is to construct a phylogeny solely from a
collection of characters. Throughout character-based phylogeny, our two-part assumption is that
all taxa possessing a character must have evolved from a single ancestor that introduced this character,
and conversely, any taxon not possessing the character cannot be descended from this ancestor.
Problem
Given a collection of n taxa, any subsetS of these taxa can be seen as encoding a character
that divides the taxa into the sets S and S^{\textrm{c}};
we can represent the character by S \mid S^{\textrm{c}}, which is called a split.
Alternately, the character can be represented by a character arrayA of length n for which A[j] = 1 if the jth
taxon belongs to S and A[j] = 0 if the jth taxon belongs to S^{\textrm{c}} (recall the "ON"/"OFF" analogy from “Counting Subsets”).
At the same time, observe that the removal of an edge from an unrooted binary tree produces two
separate trees, each one containing a subset of the original taxa. So each edge may also be
encoded by a split S \mid S^{\textrm{c}}.
A trivial character isolates a single taxon into a group of its own. The corresponding split
S \mid S^{\textrm{c}} must be such that S or S^{\textrm{c}} contains only one element;
the edge encoded by this split must be incident to a leaf of the unrooted binary tree, and
the array for the character contains exactly one 0 or exactly one 1. Trivial characters
are of no phylogenetic interest because they fail to provide us with information regarding
the relationships of taxa to each other. All other characters are called nontrivial characters
(and the associated splits are called nontrivial splits).
A character table is a matrix C in which each row represents the array notation for a nontrivial character.
That is, entry C_{i,j} denotes the "ON"/"OFF" position
of the ith character with respect to the jth taxon.
Given: An unrooted binary tree T in Newick format for at most 200 species taxa.
Return: A character table having the same splits as the edge splits of T. The columns of the character table
should encode the taxa ordered lexicographically; the rows of the character table may be given
in any order. Also, for any given character, the particular subset of taxa to which 1s are
assigned is arbitrary.