K-means clustering
input: K set of points $X_i \dots X_n$
place centroids $C_i, \dots, C_n$ at random locations
Repeat until convergence:
- for each point $x_i$
- find nearest centroid $c_j$ (average min $D(x_i, c_j)$)
- assign the point $x_i$ to cluster $j$
- for each cluster $j = 1 \dots k$: (recompute each cluster centroid position)
- new centroid $c_j = $ mean of all points $x_i$ assigned to cluster $j$ in previous step.
Example:
Initially, the random placed centroids red and yellow triangles.
Calculate distance from $x_i$ to each centroid, and assign them into either cluster.
Recompute the centroids of clusters.
Repeat the steps until there is no items re-assign to a different cluster.
Advantages and Disadvantages
Disadvantages:
- Euclidean distance is used as a metric and variance is used as a measure of cluster scatter (Can not apply for categorical attributes).
- The number of clusters $k$ is an input parameter: an inappropriate choice of $k$ may yield poor results.
- Convergence to a local minimum may produce wrong results
Advantages: K-means is fast.