HierarchicalClustering, whose pseudocode is shown below, progressively generates n different partitions of the underlying data into clusters, all represented by a tree in which each node is labeled by a cluster of genes. The first partition has n single-element clusters represented by the leaves of the tree, with each element forming its own cluster. The second partition merges the two “closest” clusters into a single cluster consisting of two elements. In general, the i-th partition merges the two closest clusters from the (i - 1)-th partition and has n - i + 1 clusters. We hope this algorithm looks familiar — it is UPGMA (from “Implement UPGMA”) in disguise.

HierarchicalClustering(D, n) Clusters ← n single-element clusters labeled 1, ... , n
construct a graph T with n isolated nodes labeled by single elements 1, ... , n while there is more than one cluster
find the two closest clusters C_{i} and C_{j}
merge C_{i} and C_{j} into a new cluster C_{new} with |C_{i}| + |C_{j}| elements
add a new node labeled by cluster C_{new} to T
connect node C_{new} to C_{i} and C_{j} by directed edges
remove the rows and columns of D corresponding to C_{i} and C_{j}
remove C_{i} and C_{j} from Clusters
add a row/column to D for C_{new} by computing D(C_{new}, C) for each C in Clusters
add C_{new} to Clusters
assign root in T as a node with no incoming edges returnT

Note that we have not yet defined how HierarchicalClustering computes the distance D(C_{new}, C) between a newly formed cluster C_{new} and each old cluster C. In practice, clustering algorithms vary in how they compute these distances, with results that can vary greatly. One commonly used approach defines the distance between clusters C_{1 and C2 as the smallest distance between any pair of elements from these clusters,}

D_{min}(C_{1},C_{2}) = min_{all points i in cluster C1, all points j in cluster C2}D_{i, j }.

The distance function that we encountered with UPGMA uses the average distance between elements in two clusters,

$$D_\text{avg}(C_1, C_2) = \dfrac{\sum_{\text{all points }i\text{ in cluster }C_1} ~\sum_{\text{all points }j\text{ in cluster }C_2} D_{i,j}}{|C_1| \cdot |C_2|}$$

Implement Hierarchical Clustering

Given: An integer n, followed by an nxn distance matrix.

Return: The result of applying HierarchicalClustering to this distance matrix (using D_{avg}), with each newly created cluster listed on each line.