CS-GY 6923: Machine Learning
1.0.0
1.0.0
  • Introduction
  • What is Machine Learning?
  • Types of Machine Learning
    • Supervised Learning
      • Notations
      • Probabilistic Modeling
        • Naive Bayes Classifier
      • Linear Regression
      • Nearest Neighbor
      • Evaluating a Classifier
      • Parametric Estimation
        • Bayesian Approach to Parameter Estimation
        • Parametric Estimation for Simple Linear Regression
        • Parametric Estimation for Multivariate Linear Regression
        • Parametric Estimation for Simple Polynomial Regression
        • Parametric Estimation for Multivariate Polynomial Regression
      • Bias and Variance of an Estimator
      • Bias and Variance of a Regression Algorithm
        • Model Selection
      • Logistic Regression
      • Decision Trees
        • Using Decision Trees for Regression
        • Bias and Variance
      • Dimensionality Reduction
      • Neural Networks
        • Training a Neuron
        • MLP
          • Regression with Multiple Outputs
          • Advice/Tricks and Issues to Train a Neural Network
        • Deep Learning
      • Support Vector Machines
      • Ensemble Learning
    • Unsupervised Learning
      • K-Means Clustering
      • Probabilistic Clustering
    • Reinforcement Learning
Powered by GitBook
On this page
  • Example with a Discrete Prior on
  • Example with a Continuous Prior on
  • Estimates of Mean and Variance of a Distribution (not just Gaussian)

Was this helpful?

  1. Types of Machine Learning
  2. Supervised Learning
  3. Parametric Estimation

Bayesian Approach to Parameter Estimation

Treat Θ\ThetaΘ as a random variable with prior p(Θ)p(\Theta)p(Θ).

According to Bayes' Rule, p(Θ∣X)=p(Θ)p(X∣Θ)p(X)p(\Theta|X) = \frac{p(\Theta)p(X|\Theta)}{p(X)}p(Θ∣X)=p(X)p(Θ)p(X∣Θ)​

  • The ML estimate is given by:

    ΘML=argmaxΘ p(X∣Θ)\Theta_{ML} = argmax_\Theta \, p(X|\Theta)ΘML​=argmaxΘ​p(X∣Θ)

  • The MAP estimate is given by:

    ΘMAP=argmaxΘ p(X∣Θ)p(Θ)\Theta_{MAP} = argmax_\Theta \,p(X|\Theta)p(\Theta)ΘMAP​=argmaxΘ​p(X∣Θ)p(Θ)

  • The Bayes Estimate is given by:

    ΘBAYES′=E[Θ∣X]=∫ΘΘp(Θ∣X)dΘ\Theta_{BAYES'} = E[\Theta|X] = \int_\Theta \Theta p(\Theta|X) d\ThetaΘBAYES′​=E[Θ∣X]=∫Θ​Θp(Θ∣X)dΘ (the integral becomes a summation for discrete values)

Example with a Discrete Prior on Θ\ThetaΘ

Consider a parameterized distribution uniform on [0,Θ][0,\Theta][0,Θ].

Say the discrete prior on Θ\ThetaΘ is given by:

P(Θ=1)=2/3P(\Theta=1) = 2/3P(Θ=1)=2/3

P(Θ=2)=1/3P(\Theta=2) = 1/3P(Θ=2)=1/3

Suppose X={0.5,1.3,0.7} Given X, we know that P(Θ∣X)=0P(\Theta|X) = 0P(Θ∣X)=0 and therefore, P(Θ=2∣X)=1P(\Theta=2|X)=1P(Θ=2∣X)=1. So, the ML, MAP and BAYES' hypotheses are all 2.

Now, suppose X={0.5,0.7,0.1} p(X∣Θ=1)=13=1p(X|\Theta=1) = 1^3 = 1p(X∣Θ=1)=13=1

p(X∣Θ=2)=(1/2)3=1/8p(X|\Theta=2) = (1/2)^3 = 1/8p(X∣Θ=2)=(1/2)3=1/8

So, p(X)=P(Θ=1)p(X∣Θ=1)+P(Θ=2)p(X∣Θ=2)=51/72p(X) = P(\Theta=1)p(X|\Theta=1) + P(\Theta=2)p(X|\Theta=2) = 51/72p(X)=P(Θ=1)p(X∣Θ=1)+P(Θ=2)p(X∣Θ=2)=51/72

Therefore, P(Θ=1∣X)=p(X∣Θ=1)P(Θ=1)p(X)=48/51P(\Theta=1|X) = \frac{p(X|\Theta=1)P(\Theta=1)}{p(X)} = 48/51P(Θ=1∣X)=p(X)p(X∣Θ=1)P(Θ=1)​=48/51 and P(Θ=2∣X)=3/51P(\Theta=2|X) = 3/51P(Θ=2∣X)=3/51

In this case, the MAP hypothesis is 1 and the ML hypothesis is 1. The Bayes' hypothesis can be computed as E[Θ∣X]=1∗(48/51)+2∗(3/51)=54/51=1.06E[\Theta|X] = 1*(48/51) + 2*(3/51) = 54/51 = 1.06E[Θ∣X]=1∗(48/51)+2∗(3/51)=54/51=1.06

The posterior density of x given X is given by:

p(x=0.82∣X)=p(Θ=1∣X)p(x=0.82∣Θ=1)+p(Θ=2∣X)p(x=0.82∣Θ=2)p(x=0.82|X) =p(\Theta=1|X)p(x=0.82|\Theta=1) + p(\Theta=2|X)p(x=0.82|\Theta=2)p(x=0.82∣X)=p(Θ=1∣X)p(x=0.82∣Θ=1)+p(Θ=2∣X)p(x=0.82∣Θ=2)

=(48/51)∗1+(3/51)∗(1/2)=99/102=(48/51)*1 + (3/51)*(1/2) = 99/102=(48/51)∗1+(3/51)∗(1/2)=99/102

Example with a Continuous Prior on Θ\ThetaΘ

Assume the data X is drawn from a Gaussian with a known variance σ2\sigma^2σ2 and an unknown mean μ\muμ (this is now the Θ\ThetaΘ).

Assume a Gaussian prior on Θ\ThetaΘ i.e. Θ∼N(μ0,σ02)\Theta \sim N(\mu_0, \sigma_0^2)Θ∼N(μ0​,σ02​) and μ0, σ02\mu_0, \, \sigma_0^2μ0​,σ02​ are known.

Then, generate X from N(Θ,σ2)N(\Theta, \sigma^2)N(Θ,σ2) (this Θ\ThetaΘ is the mean of the Gaussian from which X was chosen. It is what we need to estimate.)

Given X, we have:

ΘML=m (i.e. the  sample  mean)=∑txtN\Theta_{ML} = m\, (i.e.\, the\,\, sample\,\, mean) = \frac{\sum_t x^t}{N}ΘML​=m(i.e.thesamplemean)=N∑t​xt​

ΘMAP=N/σ2N/σ2+1/σ02m+1/σ02N/σ2+1/σ02μ0\Theta_{MAP} = \frac{N/\sigma^2}{N/\sigma^2 + 1/\sigma_0^2}m + \frac{1/\sigma_0^2}{N/\sigma^2 + 1/\sigma_0^2}\mu_0ΘMAP​=N/σ2+1/σ02​N/σ2​m+N/σ2+1/σ02​1/σ02​​μ0​

ΘBAYES′=ΘMAP!\Theta_{BAYES'} = \Theta_{MAP}!ΘBAYES′​=ΘMAP​!

As N→∞N\rightarrow\inftyN→∞, m dominates the weighted sum of m and μ0\mu_0μ0​.

Estimates of Mean and Variance of a Distribution (not just Gaussian)

The ML estimate for the mean is m i.e. the sample mean.

The ML estimate of variance is ∑t(xt−m)2N\frac{\sum_t (x^t-m)^2}{N}N∑t​(xt−m)2​ (this is biased since E[σ2]<σ2E[\sigma^2] < \sigma^2E[σ2]<σ2).

Note that the estimate for variance is lower than the actual value because we use the sample mean m to compute it instead of using the actual mean.

However, ∑t(xt−m)2N−1\frac{\sum_t (x^t-m)^2}{N-1}N−1∑t​(xt−m)2​ is an unbiased estimate.

PreviousParametric EstimationNextParametric Estimation for Simple Linear Regression

Last updated 5 years ago

Was this helpful?