CS-GY 6923: Machine Learning
1.0.0
1.0.0
  • Introduction
  • What is Machine Learning?
  • Types of Machine Learning
    • Supervised Learning
      • Notations
      • Probabilistic Modeling
        • Naive Bayes Classifier
      • Linear Regression
      • Nearest Neighbor
      • Evaluating a Classifier
      • Parametric Estimation
        • Bayesian Approach to Parameter Estimation
        • Parametric Estimation for Simple Linear Regression
        • Parametric Estimation for Multivariate Linear Regression
        • Parametric Estimation for Simple Polynomial Regression
        • Parametric Estimation for Multivariate Polynomial Regression
      • Bias and Variance of an Estimator
      • Bias and Variance of a Regression Algorithm
        • Model Selection
      • Logistic Regression
      • Decision Trees
        • Using Decision Trees for Regression
        • Bias and Variance
      • Dimensionality Reduction
      • Neural Networks
        • Training a Neuron
        • MLP
          • Regression with Multiple Outputs
          • Advice/Tricks and Issues to Train a Neural Network
        • Deep Learning
      • Support Vector Machines
      • Ensemble Learning
    • Unsupervised Learning
      • K-Means Clustering
      • Probabilistic Clustering
    • Reinforcement Learning
Powered by GitBook
On this page
  • EM in the Context of GMMs
  • Algorithm
  • Why Use K Means instead of EM?

Was this helpful?

  1. Types of Machine Learning
  2. Unsupervised Learning

Probabilistic Clustering

Clusters form because of one simple reason: the points are being drawn from different distributions. Points that come from the same distribution tend to get clustered together.

Say we have data for men's heights and women's heights. The data points will clearly be from two Gaussian distributions.

Sometimes, we do not know how many distributions the data points were drawn from. So, we could experiment with different k values.

Say we have two distributions with means μ1,μ2\mu_1,\mu_2μ1​,μ2​ and standard deviations σ1,σ2\sigma_1,\sigma_2σ1​,σ2​. Then, p(X∣μ1,μ2,σ1,σ2)=Πx∈X[12(12πσ1e−(x−μ1)22σ12)+12(12πσ2e−(x−μ2)22σ22)]p(X|\mu_1,\mu_2,\sigma_1,\sigma_2)=\Pi_{x\in X}[\frac{1}{2}(\frac{1}{\sqrt{2\pi}\sigma_1}e^{-\frac{(x-\mu_1)^2}{2\sigma_1^2}}) + \frac{1}{2}(\frac{1}{\sqrt{2\pi}\sigma_2}e^{-\frac{(x-\mu_2)^2}{2\sigma_2^2}})]p(X∣μ1​,μ2​,σ1​,σ2​)=Πx∈X​[21​(2π​σ1​1​e−2σ12​(x−μ1​)2​)+21​(2π​σ2​1​e−2σ22​(x−μ2​)2​)]

We need to estimate the parameters μ1,μ2\mu_1,\mu_2μ1​,μ2​ for the Mixture of the Gaussians (i.e. a Gaussian Mixture Model (GMM)), assuming we have a known σ  (σ1=σ2=σ)\sigma \,\,(\sigma_1=\sigma_2=\sigma)σ(σ1​=σ2​=σ) and a uniform prior.

argmaxμ1,μ2  p(X∣μ1,μ2)argmax_{\mu_1,\mu_2}\,\, p(X|\mu_1,\mu_2)argmaxμ1​,μ2​​p(X∣μ1​,μ2​)

To do so, we use an approach called Expectation Maximization (EM)

EM in the Context of GMMs

Here, we discuss EM in the context of a Mixture of 2 Gaussians, with the assumptions that we have a known σ  (σ1=σ2=σ)\sigma \,\,(\sigma_1=\sigma_2=\sigma)σ(σ1​=σ2​=σ) and a uniform prior.

Assume each data point is actually a triple of the form (x,z1,z2)(x,z_1,z_2)(x,z1​,z2​), where x is given but z1,z2z_1, z_2z1​,z2​ are unobservable. z1=1z_1=1z1​=1 if the point was generated from G1G_1G1​(Gaussian 1), and 0 otherwise. Similarly for z2z_2z2​.

Algorithm

  1. Pick random initial values for h=(μ1,μ2)h=(\mu_1,\mu_2)h=(μ1​,μ2​)

  2. E Step:

    for j=1 to 2:

    calculate the expected value E[zj]E[z_j]E[zj​] for each x, assuming that the current hypothesis is correct

    E[z1]=1.p(x  was  generated  from  dist.  1)+0.p(x  was  generated  from  dist.  2)E[z_1]=1.p(x\,\, was\,\, generated\,\, from\,\, dist.\,\, 1)+0.p(x\,\, was\,\, generated\,\, from\,\, dist.\,\, 2)E[z1​]=1.p(xwasgeneratedfromdist.1)+0.p(xwasgeneratedfromdist.2)

    =p(x  was  generated  from  dist.  1)=P(G1∣x)= p(x\,\, was\,\, generated\,\, from\,\, dist.\,\, 1) = P(G_1|x)=p(xwasgeneratedfromdist.1)=P(G1​∣x)

    in general, E[zj]=P(Gj∣x)=p(x∣Gj)P(Gj)p(x)E[z_j]= P(G_j|x)=\frac{p(x|G_j)P(G_j)}{p(x)}E[zj​]=P(Gj​∣x)=p(x)p(x∣Gj​)P(Gj​)​

    here, E[z1]=p(x∣G1).12p(x∣G1).12+p(x∣G2).12E[z_1]=\frac{p(x|G_1).\frac{1}{2}}{p(x|G_1).\frac{1}{2}+p(x|G_2).\frac{1}{2}}E[z1​]=p(x∣G1​).21​+p(x∣G2​).21​p(x∣G1​).21​​ (since we assume uniform prior, P(G1)=P(G2)=12P(G_1)=P(G_2)=\frac{1}{2}P(G1​)=P(G2​)=21​) = p(x∣G1)p(x∣G1)+p(x∣G2)\frac{p(x|G_1)}{p(x|G_1)+p(x|G_2)}p(x∣G1​)+p(x∣G2​)p(x∣G1​)​

    and E[z2]=p(x∣G2)p(x∣G1)+p(x∣G2)E[z_2]= \frac{p(x|G_2)}{p(x|G_1)+p(x|G_2)}E[z2​]=p(x∣G1​)+p(x∣G2​)p(x∣G2​)​. So, E[z1]+E[z2]=1E[z_1]+E[z_2]=1E[z1​]+E[z2​]=1.

  3. M Step: for j= 1 to 2: update the means, using a weighted average: μj=∑x∈Xx.P(Gj∣x)∑xP(Gj∣x)\mu_j = \frac{\sum_{x\in X} x.P(G_j|x)}{\sum_x P(G_j|x)}μj​=∑x​P(Gj​∣x)∑x∈X​x.P(Gj​∣x)​

  4. Repeat 2 and 3 until convergence.

EM may not converge to a global optimum, but will at least converge to a local optimum, in a reasonable amount of time.

Note: This is soft-clustering. If we wanted to perform hard clustering, we would make z1z_1z1​ or z2z_2z2​ equal to 0 or 1, depending on whichever was bigger than the other, for each example in each iteration.

Why Use K Means instead of EM?

  • It does not make any underlying assumptions about the distributions from which the data was drawn. So, if we aren't convinced that our data comes from Gaussians, we may want to steer clear from EM.

  • Also, EM gives a soft clustering by default, while KNN gives a hard clustering.

PreviousK-Means ClusteringNextReinforcement Learning

Last updated 5 years ago

Was this helpful?