Probabilistic Clustering
Last updated
Last updated
Clusters form because of one simple reason: the points are being drawn from different distributions. Points that come from the same distribution tend to get clustered together.
Say we have data for men's heights and women's heights. The data points will clearly be from two Gaussian distributions.
Sometimes, we do not know how many distributions the data points were drawn from. So, we could experiment with different k values.
Say we have two distributions with means and standard deviations . Then,
We need to estimate the parameters for the Mixture of the Gaussians (i.e. a Gaussian Mixture Model (GMM)), assuming we have a known and a uniform prior.
To do so, we use an approach called Expectation Maximization (EM)
Here, we discuss EM in the context of a Mixture of 2 Gaussians, with the assumptions that we have a known and a uniform prior.
Assume each data point is actually a triple of the form , where x is given but are unobservable. if the point was generated from (Gaussian 1), and 0 otherwise. Similarly for .
Pick random initial values for
E Step:
for j=1 to 2:
calculate the expected value for each x, assuming that the current hypothesis is correct
in general,
here, (since we assume uniform prior, ) =
and . So, .
M Step: for j= 1 to 2: update the means, using a weighted average:
Repeat 2 and 3 until convergence.
EM may not converge to a global optimum, but will at least converge to a local optimum, in a reasonable amount of time.
It does not make any underlying assumptions about the distributions from which the data was drawn. So, if we aren't convinced that our data comes from Gaussians, we may want to steer clear from EM.
Also, EM gives a soft clustering by default, while KNN gives a hard clustering.
Note: This is soft-clustering. If we wanted to perform hard clustering, we would make or equal to 0 or 1, depending on whichever was bigger than the other, for each example in each iteration.