We assume the data is IID which means identically distributed and independently drawn from same distribution.
Recall types of clustering methods
Automatically discover all parameters for K “sources”(k guassian distribution)
In the first graph, if we already know which point come from which distribution, then it is easy to figure out the parameters of the Guassian distribution.
But if we don’t know which point come from which distribution (figure 2), it is going to be tricky.
But, if we know the parameters of the two distribution, we could figure it out by calculating the probabilities of each data point.
Therefore, we have a chicken and egg problem.
If somebody told you the mean and variance, you could figure out which distribution the data came from. On the other hand, if you know which data point came from which distribution, you could calculate the mean and variance.
So you need one to imply the other.
That’s basically what EM algorithm does for you.
EM algorithm
At first, we assign random value to two guassian distribution.
Then we compute the probabilities for all the points. And then assigned value to these points.
Once we have done the color assignment. We could re-compute the the mean and variance for both distributions.
$\mu_b = \frac{b_1x_1 + b_2x_2 + \dots + b_nx_n}{b_1 + b_2 + \dots + b_n}$
$\sigma_b^2 = \frac{b_1(x_1 - \mu_b)^2 + \dots + b_n(x_n - \mu_b)^2}{b_1 + b_2 + \dots + b_n}$
After this step, your guassian distribution will looks like this.
And it eventually will looks like this: