k-means clustering (naive)

k-means is a method to partition a set of n examples in k clusters Si in S around the nearest center in order to find the best assignment:

\underset{\mathbf{S}}{\arg\min} \sum_{i=1}^{k} \sum_{\mathbf x_j \in S_i} \left\| \mathbf x_j - \boldsymbol\mu_i \right\|^2

The problem is complex, so the standard k-means algorithm is based on a two steps heuristics:

  • Assignment step: assign each example at the cluster with the nearest center

  • Update step: update the center computing the mean on the newly assigned examples

The algorithm converges when the assignments no longer change.

The main loop of the algorithm is:

To test this naive implementation we can run it on the famous Iris dataset. In this demo we shuffle the dataset with a fixed seed for reproducibility reasons:

You can download the complete script here:

http://gist.github.com/457394

In general for most start choices the termination of the algorithm depends on the order of the examples. One can run it multiple times and chose the best clustering for some performance measure.