Sept. 13, 2015, 10:41 p.m. by Rosalind Team

To think about clustering as dividing data points *Data* into *k* clusters, we will try to select a set *Centers* of *k* points that will serve as the **centers** of these clusters. We would like to choose *Centers* so that they minimize some distance function between *Centers* and *Data* over all possible choices of centers. But how should this distance function be defined?

First, we define the **Euclidean distance** between points *v* = (*v*_{1}, ... , *v*_{m}) and *w *= (*w*_{1}, ... , *w*_{m}) in m-dimensional space, denoted *d*(*v*, *w*), as the length of the line segment connecting these points,

Next, given a point *DataPoint* in multi-dimensional space and a set of *k* points Centers, we define the distance from *DataPoint* to *Centers*, denoted *d*(*DataPoint*, *Centers*), as the Euclidean distance from *DataPoint* to its closest center,

*d*(*DataPoint*,*Centers*) = min_{all points x from Centers}*d*(*DataPoint*, *x*).

The length of the segments in the figure below correspond to *d*(*DataPoint*, *Centers*) for each point *DataPoint*.

We now define the distance between all data points *Data* and centers *Centers*. This distance, denoted *MaxDistance*(*Data*, *Centers*), is the maximum of *d*(*DataPoint*, *Centers*) among all data points *DataPoint*,

*MaxDistance*(*Data*, *Centers*) = max_{all points DataPoint from Data}* d*(*DataPoints*,*Centers*).

We can now formulate a well-defined clustering problem.

*k***-Center Clustering Problem: ***Given a set of data points, find k centers minimizing the maximum distance between these data points and centers. *

over all possible choices of

Although the *k*-Center Clustering Problem is easy to state, it is *NP*-Hard. The **Farthest First Traversal** heuristic, whose pseudocode is shown below, selects centers from the points in *Data* (instead of from all possible points in *m*-dimensional space). It begins by selecting an arbitrary point in *Data* as the first center and iteratively adds a new center as the point in *Data* that is farthest from the centers chosen so far, with ties broken arbitrarily.

add

Given: Integers *k *and *m *followed by a set of points *Data *in *m*-dimensional space.

Return: A set *Centers* consisting of *k* points (centers) resulting from applying **FarthestFirstTraversal**(*Data*, *k*), where the first point from *Data* is chosen as the first center to initialize the algorithm.

3 2 0.0 0.0 5.0 5.0 0.0 5.0 1.0 1.0 2.0 2.0 3.0 3.0 1.0 2.0

0.0 0.0 5.0 5.0 0.0 5.0