ROSALIND | Compute the Squared Error Distortion

FarthestFirstTraversal (which we introduced in “Implement FarthestFirstTraversal”) is fast, and its solution approximates the optimal solution of the k-Center Clustering Problem; however, this algorithm is rarely used for gene expression analysis. In k-Center Clustering, we selected Centers so that these points would minimize MaxDistance(Data, Centers), the maximum distance between any point in Data and its nearest center. But biologists are usually interested in analyzing typical rather than maximum deviations, since the latter may correspond to outliers representing experimental errors.

To address limitations of MaxDistance, we will introduce a new scoring function. Given a set Data of n data points and a set Centers of k centers, the squared error distortion of Data and Centers, denoted Distortion(Data, Centers), is defined as the mean squared distance from each data point to its nearest center,

Distortion(Data,Centers) = (1/n) ∑_{all points DataPoint in Data}d(DataPoint, Centers)².

Squared Error Distortion Problem

Given: Integers k and m, followed by a set of centers Centers and a set of points Data.

Return: The squared error distortion Distortion(Data, Centers).

Sample Dataset

2 2
2.31 4.55
5.96 9.08
--------
3.42 6.03
6.23 8.25
4.76 1.64
4.47 4.33
3.95 7.61
8.93 2.97
9.74 4.03
1.73 1.28
9.72 5.01
7.27 3.77

Sample Output

18.246

Compute the Squared Error Distortion solved by 265

Squared Error Distortion Problem

Sample Dataset

Sample Output

Extra Dataset

Compute the Squared Error Distortion solved by 265

Squared Error Distortion Problem

Sample Dataset

Sample Output

Extra Dataset

Report a typo