Linear Regression

As discussed earlier, Regression is a Supervised Learning technique that is used to predict a real value.

Given a dataset, X={Xt,rt}t=1NX = \{X^t, r^t\}_{t=1}^N, our aim is to find the line that minimizes the Mean Squared Error.

The function that we want to learn can be denoted as:

f(xθ)=w0+w1Xf(x|\theta) = w_0 + w_1X

We need to find the best values for w0,w1w_0, w_1 (denoted using the term θ\theta, meaning parameters) using the available training data.

The Mean Squared Error of f on the data set X is given by:

Err(fX)=1Nt=1N(rtf(xt))2Err(f|X) = \frac{1}{N} \sum_{t=1}^N (r^t - f(x^t))^2

Note that the absolute error would be: 1Nt=1Nrtf(xt)\frac{1}{N} \sum_{t=1}^N |r^t - f(x^t)|

One advantage of using MSE instead of absolute error is that it will penalize a line more for being further away from the data points. However, this makes it sensitive to outliers i.e. a line would be penalized for being far away from outliers, although it shouldn't be. Another reason for using MSE over absolute error is that MSE is continuous while absolute error is not. It is easier to minimize a continuous function by taking the derivative.

The Role of Noise

When we obtain a data set, we assume that the values were computed with an instrument that was susceptible to noise. It is generally assumed that the noise follows a Gaussian distribution.

Therefore, rtr^t will not be exactly equal to f(xtx^t). Instead:

rt=f(xt)+ϵtr^t = f(x^t) + \epsilon^t

where the noise ϵtN(0,σ2)\epsilon^t \sim N(0, \sigma^2)

We assume that each ϵt\epsilon^t is independent.

Noise also affects the probability:

p(rtxt)=N(f(xt),σ2)p(r^t|x^t) = N(f(x^t), \sigma^2)

Now, we want to find the ML (Maximum Likelihood) value for θ\theta i.e. the ML 'estimator' f.

=argmaxθp(Xθ)=argmaxθΠt=1Np(rtxt,θ)= argmax_{\theta} p(X|\theta) = argmax_{\theta} \Pi_{t=1}^{N} p(r^t|x^t, \theta) --- (1)

Recall that if XN(μ,σ2)X \sim N(\mu, \sigma^2), then the pdf for the distribution on X is given by:

p(X)=12πσe(Xμ)22σ2p(X) = \frac{1}{\sqrt{2\pi}\sigma} e^{\frac{-(X-\mu)^2}{2\sigma^2}}

If we use this formula in (1), it might make things complicated. Instead, we use the log trick. We can do this because argmax is unaffected by log, for non-negative valued functions. Using a log turns products into sums and gets rid of exponents, making things less complicated.

Now, (1) becomes:

argmaxθlogΠt=1Np(rtxt,θ)=argmaxθt=1Nlogp(rtxt,θ)=argmaxθt=1Nlog12πσe(rt(w1xt+w0))22σ2argmax_{\theta}\, log \Pi_{t=1}^N p(r^t|x^t, \theta) = argmax_\theta \sum_{t=1}^N log\, p(r^t|x^t, \theta) = argmax_\theta \, \sum_{t=1}^N log\,\frac{1}{\sqrt{2\pi}\sigma} e ^{\frac{-(r^t-(w_1x^t+w_0))^2}{2\sigma^2}}

=argmaxθt=1Nlog12πσ(canbeignored)+t=1N(rt(w1xt+w0))22σ2=argminθt=1N(rt(w1xt+w0))2= argmax_\theta \, \sum_{t=1}^{N} log \,\frac{1}{\sqrt{2\pi}\sigma} \, (can\,\,be\,\,ignored)+\, \sum_{t=1}^N \, \frac{-(r^t-(w_1x^t+w_0))^2}{2\sigma^2} = argmin_\theta \,\sum_{t=1}^N (r^t-(w_1x^t+w_0))^2 (ignoring the denominator).

This proves that, under the assumption of independent additive Gaussian noise, the line that denotes the ML estimator (i.e. the ML hypothesis) is the same line that minimizes the MSE.

Last updated