Of course, in practice, the problem operates in reverse. We first need to generate a character table
before we can model a phylogeny on this table. In modern genetics, a reliable way to
obtain a large number of characters is by using SNPs. As mentioned in “Counting Subsets”, for a given SNP, we can
divide taxa into two sets depending on which of two bases is present at the nucleotide,
thus defining the split of a character.
Problem
A collection of strings is characterizable if there are at most two possible choices for the symbol
at each position of the strings.
Given: A collection of at most 100 characterizable DNA strings, each of length at most 300 bp.
Return: A character table for which each nontrivial character encodes the symbol choice at
a single position of the strings. (Note: the choice of assigning '1' and '0'
to the two states of each SNP in the strings is arbitrary.)
Sample Dataset
ATGCTACC
CGTTTACC
ATTCGACC
AGTCTCCC
CGTCTATC
Sample Output
10110
10100
Noteclick to expand
Recall that the character table does not encode trivial characters.