K means clustering algorithm

K-means clustering

input: K set of points $X_i \dots X_n$

place centroids $C_i, \dots, C_n$ at random locations

Repeat until convergence:

for each point $x_i$
- find nearest centroid $c_j$ (average min $D(x_i, c_j)$)
- assign the point $x_i$ to cluster $j$
for each cluster $j = 1 \dots k$: (recompute each cluster centroid position)
- new centroid $c_j = $ mean of all points $x_i$ assigned to cluster $j$ in previous step.

Example:

initial

Initially, the random placed centroids red and yellow triangles.

assign

Calculate distance from $x_i$ to each centroid, and assign them into either cluster.

Imgur

Recompute the centroids of clusters.

Imgur

Repeat the steps until there is no items re-assign to a different cluster.

Disadvantages:

Euclidean distance is used as a metric and variance is used as a measure of cluster scatter (Can not apply for categorical attributes).
The number of clusters $k$ is an input parameter: an inappropriate choice of $k$ may yield poor results.
Convergence to a local minimum may produce wrong results

Advantages: K-means is fast.

Published 11 June 2015