CS-GY 6923: Machine Learning
1.0.0
1.0.0
  • Introduction
  • What is Machine Learning?
  • Types of Machine Learning
    • Supervised Learning
      • Notations
      • Probabilistic Modeling
        • Naive Bayes Classifier
      • Linear Regression
      • Nearest Neighbor
      • Evaluating a Classifier
      • Parametric Estimation
        • Bayesian Approach to Parameter Estimation
        • Parametric Estimation for Simple Linear Regression
        • Parametric Estimation for Multivariate Linear Regression
        • Parametric Estimation for Simple Polynomial Regression
        • Parametric Estimation for Multivariate Polynomial Regression
      • Bias and Variance of an Estimator
      • Bias and Variance of a Regression Algorithm
        • Model Selection
      • Logistic Regression
      • Decision Trees
        • Using Decision Trees for Regression
        • Bias and Variance
      • Dimensionality Reduction
      • Neural Networks
        • Training a Neuron
        • MLP
          • Regression with Multiple Outputs
          • Advice/Tricks and Issues to Train a Neural Network
        • Deep Learning
      • Support Vector Machines
      • Ensemble Learning
    • Unsupervised Learning
      • K-Means Clustering
      • Probabilistic Clustering
    • Reinforcement Learning
Powered by GitBook
On this page

Was this helpful?

  1. Types of Machine Learning
  2. Supervised Learning

Linear Regression

As discussed earlier, Regression is a Supervised Learning technique that is used to predict a real value.

Given a dataset, X={Xt,rt}t=1NX = \{X^t, r^t\}_{t=1}^NX={Xt,rt}t=1N​, our aim is to find the line that minimizes the Mean Squared Error.

The function that we want to learn can be denoted as:

f(x∣θ)=w0+w1Xf(x|\theta) = w_0 + w_1Xf(x∣θ)=w0​+w1​X

We need to find the best values for w0,w1w_0, w_1w0​,w1​ (denoted using the term θ\thetaθ, meaning parameters) using the available training data.

The Mean Squared Error of f on the data set X is given by:

Err(f∣X)=1N∑t=1N(rt−f(xt))2Err(f|X) = \frac{1}{N} \sum_{t=1}^N (r^t - f(x^t))^2Err(f∣X)=N1​∑t=1N​(rt−f(xt))2

Note that the absolute error would be: 1N∑t=1N∣rt−f(xt)∣\frac{1}{N} \sum_{t=1}^N |r^t - f(x^t)|N1​∑t=1N​∣rt−f(xt)∣

One advantage of using MSE instead of absolute error is that it will penalize a line more for being further away from the data points. However, this makes it sensitive to outliers i.e. a line would be penalized for being far away from outliers, although it shouldn't be. Another reason for using MSE over absolute error is that MSE is continuous while absolute error is not. It is easier to minimize a continuous function by taking the derivative.

The Role of Noise

When we obtain a data set, we assume that the values were computed with an instrument that was susceptible to noise. It is generally assumed that the noise follows a Gaussian distribution.

Therefore, rtr^trt will not be exactly equal to f(xtx^txt). Instead:

rt=f(xt)+ϵtr^t = f(x^t) + \epsilon^trt=f(xt)+ϵt

where the noise ϵt∼N(0,σ2)\epsilon^t \sim N(0, \sigma^2)ϵt∼N(0,σ2)

We assume that each ϵt\epsilon^tϵt is independent.

Noise also affects the probability:

p(rt∣xt)=N(f(xt),σ2)p(r^t|x^t) = N(f(x^t), \sigma^2)p(rt∣xt)=N(f(xt),σ2)

Now, we want to find the ML (Maximum Likelihood) value for θ\thetaθ i.e. the ML 'estimator' f.

=argmaxθp(X∣θ)=argmaxθΠt=1Np(rt∣xt,θ)= argmax_{\theta} p(X|\theta) = argmax_{\theta} \Pi_{t=1}^{N} p(r^t|x^t, \theta)=argmaxθ​p(X∣θ)=argmaxθ​Πt=1N​p(rt∣xt,θ) --- (1)

Recall that if X∼N(μ,σ2)X \sim N(\mu, \sigma^2)X∼N(μ,σ2), then the pdf for the distribution on X is given by:

p(X)=12πσe−(X−μ)22σ2p(X) = \frac{1}{\sqrt{2\pi}\sigma} e^{\frac{-(X-\mu)^2}{2\sigma^2}}p(X)=2π​σ1​e2σ2−(X−μ)2​

If we use this formula in (1), it might make things complicated. Instead, we use the log trick. We can do this because argmax is unaffected by log, for non-negative valued functions. Using a log turns products into sums and gets rid of exponents, making things less complicated.

Now, (1) becomes:

argmaxθ logΠt=1Np(rt∣xt,θ)=argmaxθ∑t=1Nlog p(rt∣xt,θ)=argmaxθ ∑t=1Nlog 12πσe−(rt−(w1xt+w0))22σ2argmax_{\theta}\, log \Pi_{t=1}^N p(r^t|x^t, \theta) = argmax_\theta \sum_{t=1}^N log\, p(r^t|x^t, \theta) = argmax_\theta \, \sum_{t=1}^N log\,\frac{1}{\sqrt{2\pi}\sigma} e ^{\frac{-(r^t-(w_1x^t+w_0))^2}{2\sigma^2}}argmaxθ​logΠt=1N​p(rt∣xt,θ)=argmaxθ​∑t=1N​logp(rt∣xt,θ)=argmaxθ​∑t=1N​log2π​σ1​e2σ2−(rt−(w1​xt+w0​))2​

=argmaxθ ∑t=1Nlog 12πσ (can  be  ignored)+ ∑t=1N −(rt−(w1xt+w0))22σ2=argminθ ∑t=1N(rt−(w1xt+w0))2= argmax_\theta \, \sum_{t=1}^{N} log \,\frac{1}{\sqrt{2\pi}\sigma} \, (can\,\,be\,\,ignored)+\, \sum_{t=1}^N \, \frac{-(r^t-(w_1x^t+w_0))^2}{2\sigma^2} = argmin_\theta \,\sum_{t=1}^N (r^t-(w_1x^t+w_0))^2=argmaxθ​∑t=1N​log2π​σ1​(canbeignored)+∑t=1N​2σ2−(rt−(w1​xt+w0​))2​=argminθ​∑t=1N​(rt−(w1​xt+w0​))2 (ignoring the denominator).

This proves that, under the assumption of independent additive Gaussian noise, the line that denotes the ML estimator (i.e. the ML hypothesis) is the same line that minimizes the MSE.

PreviousNaive Bayes ClassifierNextNearest Neighbor

Last updated 5 years ago

Was this helpful?