We try to separate a set into several subset of the population. One of the problems of K-Means and other clustering algorithms is picking the number of clusters. You need to pre-define $K$. Most of algorithms you either specify the number of cluster or some parameters like a threshold that dictates how many clusters you end up with. We want the algorithm to pick $K$ automatically. But, so far there is no good way to do it universally.
The root of this problem is:
The number of clusters is an ambiguous thing.
When you dealing with real world data, the data has multiple scales, it is a problem related to the granularity of the data.
So, the idea of hierachical clustering is instead of picking the number of clusters, build a hierachy.
top levels - coarse effects. low levels - fine grained
start with all items in one cluster, split recursively
We know how to split the data into fine number of clusters (e.g., number = 2).
Advantage: Fast
Disadvantage: Greedy, can’t cross boundries. Nearby points may end up in different clusters.
start with singletons. Merge by some criteria.
Idea: ensure nearby points end up in the same cluster
Start with a collection $C$ of $n$ singleton clusters
Rpeat until only one cluster is remain:
This is not the distance between two instances. This is the distance between two clusters.
It produces a dendrogram: hierachical tree of clusters
In this algorithm, you need to define a distance metric over clusters.
Disadvantages: Slow (much slower than K-means)
Advantages: If you want to produce a flat clustering, you pick a threshold on the distance, and cut the tree. Once you have a dendrogram, you can cut the data into any granularity.
Look for distance between cloest elements in clusters.
Advantages: Simple, in many situations, it works well.
Disadvantages Produce long chains. Eventually, you will put two points that far away into a same cluster.
Measure the distance between farthest elements in clusters.
But once compute the distance, still compare the cloest.
In the above two measures, they produce different results. In Single Link, red and yellow will combine. However, In Complete Link, red and yellow will not combine, instead, yellow will combine with blue.
Average of all pairwise distance. Less affected by outliers.
Distance between Centroids (means) of two clusters