K-means clustering
input: K set of points $X_i \dots X_n$
place centroids $C_i, \dots, C_n$ at random locations
Repeat until convergence:
- for each point $x_i$
- find nearest centroid $c_j$ (average min $D(x_i, c_j)$)
- assign the point $x_i$ to cluster $j$
- for each cluster $j = 1 \dots k$: (recompute each cluster centroid position)
- new centroid $c_j = $ mean of all points $x_i$ assigned to cluster $j$ in previous step.
Example:
data:image/s3,"s3://crabby-images/d6506/d650646671c8c2ed386a821e1bdd82c1727bbf71" alt="initial"
Initially, the random placed centroids red and yellow triangles.
data:image/s3,"s3://crabby-images/fe65f/fe65fc2f8adf0827a2d9a4e7789ab2b9e9e16198" alt="assign"
Calculate distance from $x_i$ to each centroid, and assign them into either cluster.
data:image/s3,"s3://crabby-images/13d5e/13d5e42952a127bcd403bbcf8dc24a45dc4abc94" alt="Imgur"
Recompute the centroids of clusters.
data:image/s3,"s3://crabby-images/94a5d/94a5d64463ed21ae77e883594f63f87a40c3c56f" alt="Imgur"
Repeat the steps until there is no items re-assign to a different cluster.
Advantages and Disadvantages
Disadvantages:
- Euclidean distance is used as a metric and variance is used as a measure of cluster scatter (Can not apply for categorical attributes).
- The number of clusters $k$ is an input parameter: an inappropriate choice of $k$ may yield poor results.
- Convergence to a local minimum may produce wrong results
Advantages: K-means is fast.