Probabilistic Clustering

Clusters form because of one simple reason: the points are being drawn from different distributions. Points that come from the same distribution tend to get clustered together.

Say we have data for men's heights and women's heights. The data points will clearly be from two Gaussian distributions.

Sometimes, we do not know how many distributions the data points were drawn from. So, we could experiment with different k values.

Say we have two distributions with means μ1,μ2\mu_1,\mu_2 and standard deviations σ1,σ2\sigma_1,\sigma_2. Then, p(Xμ1,μ2,σ1,σ2)=ΠxX[12(12πσ1e(xμ1)22σ12)+12(12πσ2e(xμ2)22σ22)]p(X|\mu_1,\mu_2,\sigma_1,\sigma_2)=\Pi_{x\in X}[\frac{1}{2}(\frac{1}{\sqrt{2\pi}\sigma_1}e^{-\frac{(x-\mu_1)^2}{2\sigma_1^2}}) + \frac{1}{2}(\frac{1}{\sqrt{2\pi}\sigma_2}e^{-\frac{(x-\mu_2)^2}{2\sigma_2^2}})]

We need to estimate the parameters μ1,μ2\mu_1,\mu_2 for the Mixture of the Gaussians (i.e. a Gaussian Mixture Model (GMM)), assuming we have a known σ(σ1=σ2=σ)\sigma \,\,(\sigma_1=\sigma_2=\sigma) and a uniform prior.

argmaxμ1,μ2p(Xμ1,μ2)argmax_{\mu_1,\mu_2}\,\, p(X|\mu_1,\mu_2)

To do so, we use an approach called Expectation Maximization (EM)

EM in the Context of GMMs

Here, we discuss EM in the context of a Mixture of 2 Gaussians, with the assumptions that we have a known σ(σ1=σ2=σ)\sigma \,\,(\sigma_1=\sigma_2=\sigma) and a uniform prior.

Assume each data point is actually a triple of the form (x,z1,z2)(x,z_1,z_2), where x is given but z1,z2z_1, z_2 are unobservable. z1=1z_1=1 if the point was generated from G1G_1(Gaussian 1), and 0 otherwise. Similarly for z2z_2.

Algorithm

  1. Pick random initial values for h=(μ1,μ2)h=(\mu_1,\mu_2)

  2. E Step:

    for j=1 to 2:

    calculate the expected value E[zj]E[z_j] for each x, assuming that the current hypothesis is correct

    E[z1]=1.p(xwasgeneratedfromdist.1)+0.p(xwasgeneratedfromdist.2)E[z_1]=1.p(x\,\, was\,\, generated\,\, from\,\, dist.\,\, 1)+0.p(x\,\, was\,\, generated\,\, from\,\, dist.\,\, 2)

    =p(xwasgeneratedfromdist.1)=P(G1x)= p(x\,\, was\,\, generated\,\, from\,\, dist.\,\, 1) = P(G_1|x)

    in general, E[zj]=P(Gjx)=p(xGj)P(Gj)p(x)E[z_j]= P(G_j|x)=\frac{p(x|G_j)P(G_j)}{p(x)}

    here, E[z1]=p(xG1).12p(xG1).12+p(xG2).12E[z_1]=\frac{p(x|G_1).\frac{1}{2}}{p(x|G_1).\frac{1}{2}+p(x|G_2).\frac{1}{2}} (since we assume uniform prior, P(G1)=P(G2)=12P(G_1)=P(G_2)=\frac{1}{2}) = p(xG1)p(xG1)+p(xG2)\frac{p(x|G_1)}{p(x|G_1)+p(x|G_2)}

    and E[z2]=p(xG2)p(xG1)+p(xG2)E[z_2]= \frac{p(x|G_2)}{p(x|G_1)+p(x|G_2)}. So, E[z1]+E[z2]=1E[z_1]+E[z_2]=1.

  3. M Step: for j= 1 to 2: update the means, using a weighted average: μj=xXx.P(Gjx)xP(Gjx)\mu_j = \frac{\sum_{x\in X} x.P(G_j|x)}{\sum_x P(G_j|x)}

  4. Repeat 2 and 3 until convergence.

EM may not converge to a global optimum, but will at least converge to a local optimum, in a reasonable amount of time.

Note: This is soft-clustering. If we wanted to perform hard clustering, we would make z1z_1 or z2z_2 equal to 0 or 1, depending on whichever was bigger than the other, for each example in each iteration.

Why Use K Means instead of EM?

  • It does not make any underlying assumptions about the distributions from which the data was drawn. So, if we aren't convinced that our data comes from Gaussians, we may want to steer clear from EM.

  • Also, EM gives a soft clustering by default, while KNN gives a hard clustering.

Last updated