mach

EM algorithm

IID

We assume the data is IID which means identically distributed and independently drawn from same distribution.

Mixture models

Recall types of clustering methods

hard clustering: clusters do not overlap
- element either belongs to cluster or it does not
soft clustering: clusters may overlap
- stength of association between clusters and instances

Mixture models (soft clustering)

probabilistically-sounded way of doing soft clustering
each cluster: a generative model (Gaussian or multinomial)
parameters(e.g. mean/covariance are unknown)

Expectation Maximization algorithm

Automatically discover all parameters for K “sources”(k guassian distribution)

In the first graph, if we already know which point come from which distribution, then it is easy to figure out the parameters of the Guassian distribution.

But if we don’t know which point come from which distribution (figure 2), it is going to be tricky.

But, if we know the parameters of the two distribution, we could figure it out by calculating the probabilities of each data point.

Imgur

Therefore, we have a chicken and egg problem.

If somebody told you the mean and variance, you could figure out which distribution the data came from. On the other hand, if you know which data point came from which distribution, you could calculate the mean and variance.

So you need one to imply the other.

That’s basically what EM algorithm does for you.

EM algorithm

start with two randomly placed Guassian $(\mu_a, \sigma_a^2)(\mu_b, \sigma_b^2)$
for each point: $P(b \mid x_i) = $ does it look like it came from b ? Unlike K-means, it does not assign a binary (yes or no) to a point, but assign a probability to a data point. Thus, it is a soft boundary.
adjust$(\mu_a, \sigma_a^2)$ and $(\mu_b, \sigma_b^2)$ to fit points assigned to them.
Iterating until it converges