Probabilistic Clustering

Clusters form because of one simple reason: the points are being drawn from different distributions. Points that come from the same distribution tend to get clustered together.

Say we have data for men's heights and women's heights. The data points will clearly be from two Gaussian distributions.

Sometimes, we do not know how many distributions the data points were drawn from. So, we could experiment with different k values.

Say we have two distributions with means $\mu_1,\mu_2$ and standard deviations $\sigma_1,\sigma_2$ . Then, $p(X|\mu_1,\mu_2,\sigma_1,\sigma_2)=\Pi_{x\in X}[\frac{1}{2}(\frac{1}{\sqrt{2\pi}\sigma_1}e^{-\frac{(x-\mu_1)^2}{2\sigma_1^2}}) + \frac{1}{2}(\frac{1}{\sqrt{2\pi}\sigma_2}e^{-\frac{(x-\mu_2)^2}{2\sigma_2^2}})]$

We need to estimate the parameters $\mu_1,\mu_2$ for the Mixture of the Gaussians (i.e. a Gaussian Mixture Model (GMM)), assuming we have a known $\sigma \,\,(\sigma_1=\sigma_2=\sigma)$ and a uniform prior.

$argmax_{\mu_1,\mu_2}\,\, p(X|\mu_1,\mu_2)$

To do so, we use an approach called Expectation Maximization (EM)

EM in the Context of GMMs

Here, we discuss EM in the context of a Mixture of 2 Gaussians, with the assumptions that we have a known $\sigma \,\,(\sigma_1=\sigma_2=\sigma)$ and a uniform prior.

Assume each data point is actually a triple of the form $(x,z_1,z_2)$ , where x is given but $z_1, z_2$ are unobservable. $z_1=1$ if the point was generated from $G_1$ (Gaussian 1), and 0 otherwise. Similarly for $z_2$ .

Algorithm

Pick random initial values for $h=(\mu_1,\mu_2)$
E Step:
for j=1 to 2:
calculate the expected value $E[z_j]$ for each x, assuming that the current hypothesis is correct
$E[z_1]=1.p(x\,\, was\,\, generated\,\, from\,\, dist.\,\, 1)+0.p(x\,\, was\,\, generated\,\, from\,\, dist.\,\, 2)$
$= p(x\,\, was\,\, generated\,\, from\,\, dist.\,\, 1) = P(G_1|x)$
in general, $E[z_j]= P(G_j|x)=\frac{p(x|G_j)P(G_j)}{p(x)}$
here, $E[z_1]=\frac{p(x|G_1).\frac{1}{2}}{p(x|G_1).\frac{1}{2}+p(x|G_2).\frac{1}{2}}$ (since we assume uniform prior, $P(G_1)=P(G_2)=\frac{1}{2}$ ) = $\frac{p(x|G_1)}{p(x|G_1)+p(x|G_2)}$
and $E[z_2]= \frac{p(x|G_2)}{p(x|G_1)+p(x|G_2)}$ . So, $E[z_1]+E[z_2]=1$ .
M Step: for j= 1 to 2: update the means, using a weighted average: $\mu_j = \frac{\sum_{x\in X} x.P(G_j|x)}{\sum_x P(G_j|x)}$
Repeat 2 and 3 until convergence.

EM may not converge to a global optimum, but will at least converge to a local optimum, in a reasonable amount of time.

Note: This is soft-clustering. If we wanted to perform hard clustering, we would make $z_1$ or $z_2$ equal to 0 or 1, depending on whichever was bigger than the other, for each example in each iteration.

Why Use K Means instead of EM?

It does not make any underlying assumptions about the distributions from which the data was drawn. So, if we aren't convinced that our data comes from Gaussians, we may want to steer clear from EM.
Also, EM gives a soft clustering by default, while KNN gives a hard clustering.

PreviousK-Means Clustering NextReinforcement Learning

Last updated 4 years ago