It appears that your browser has JavaScript disabled. Rosalind requires your browser to be JavaScript enabled.

Implement Hierarchical Clustering solved by 216

Sept. 13, 2015, 10:41 p.m. by Rosalind Team

HierarchicalClustering, whose pseudocode is shown below, progressively generates n different partitions of the underlying data into clusters, all represented by a tree in which each node is labeled by a cluster of genes. The first partition has n single-element clusters represented by the leaves of the tree, with each element forming its own cluster. The second partition merges the two “closest” clusters into a single cluster consisting of two elements. In general, the i-th partition merges the two closest clusters from the (i - 1)-th partition and has n - i + 1 clusters. We hope this algorithm looks familiar — it is UPGMA (from “Implement UPGMA”) in disguise.

HierarchicalClustering(D, n)
 Clusters ← n single-element clusters labeled 1, ... , n
  construct a graph T with n isolated nodes labeled by single elements 1, ... , n
while there is more than one cluster
find the two closest clusters C_i and C_j 
merge C_i and C_j into a new cluster C_new with |C_i| + |C_j| elements
add a new node labeled by cluster C_new to T
connect node C_new to C_i and C_j by directed edges 
remove the rows and columns of D corresponding to C_i and C_j 
remove C_i and C_j from Clusters 
add a row/column to D for C_new by computing D(C_new, C) for each C in Clusters
add C_new to Clusters
assign root in T as a node with no incoming edges
return T

Note that we have not yet defined how HierarchicalClustering computes the distance D(C_new, C) between a newly formed cluster C_new and each old cluster C. In practice, clustering algorithms vary in how they compute these distances, with results that can vary greatly. One commonly used approach defines the distance between clusters C_{1 and C₂ as the smallest distance between any pair of elements from these clusters,}

D_min(C₁,C₂) = min_{all points i in cluster C₁, all points j in cluster C₂} D_{i, j}.

The distance function that we encountered with UPGMA uses the average distance between elements in two clusters,

$$D_\text{avg}(C_1, C_2) = \dfrac{\sum_{\text{all points }i\text{ in cluster }C_1} ~\sum_{\text{all points }j\text{ in cluster }C_2} D_{i,j}}{|C_1| \cdot |C_2|}$$

Implement Hierarchical Clustering

Given: An integer n, followed by an nxn distance matrix.

Return: The result of applying HierarchicalClustering to this distance matrix (using D_avg), with each newly created cluster listed on each line.

Sample Dataset

7
0.00 0.74 0.85 0.54 0.83 0.92 0.89
0.74 0.00 1.59 1.35 1.20 1.48 1.55
0.85 1.59 0.00 0.63 1.13 0.69 0.73
0.54 1.35 0.63 0.00 0.66 0.43 0.88
0.83 1.20 1.13 0.66 0.00 0.72 0.55
0.92 1.48 0.69 0.43 0.72 0.00 0.80
0.89 1.55 0.73 0.88 0.55 0.80 0.00

Sample Output

4 6
5 7
3 4 6
1 2
5 7 3 4 6
1 2 5 7 3 4 6

Extra Dataset

Please login to solve this problem.

Implement Hierarchical Clustering solved by 216

Implement Hierarchical Clustering

Sample Dataset

Sample Output

Extra Dataset

Report a typo