Linear Regression
As discussed earlier, Regression is a Supervised Learning technique that is used to predict a real value.
Given a dataset, X={Xt,rt}t=1N, our aim is to find the line that minimizes the Mean Squared Error.
The function that we want to learn can be denoted as:
f(x∣θ)=w0+w1X
We need to find the best values for w0,w1 (denoted using the term θ, meaning parameters) using the available training data.
The Mean Squared Error of f on the data set X is given by:
Err(f∣X)=N1∑t=1N(rt−f(xt))2
Note that the absolute error would be: N1∑t=1N∣rt−f(xt)∣
One advantage of using MSE instead of absolute error is that it will penalize a line more for being further away from the data points. However, this makes it sensitive to outliers i.e. a line would be penalized for being far away from outliers, although it shouldn't be. Another reason for using MSE over absolute error is that MSE is continuous while absolute error is not. It is easier to minimize a continuous function by taking the derivative.
The Role of Noise
When we obtain a data set, we assume that the values were computed with an instrument that was susceptible to noise. It is generally assumed that the noise follows a Gaussian distribution.
Therefore, rt will not be exactly equal to f(xt). Instead:
rt=f(xt)+ϵt
where the noise ϵt∼N(0,σ2)
We assume that each ϵt is independent.
Noise also affects the probability:
p(rt∣xt)=N(f(xt),σ2)
Now, we want to find the ML (Maximum Likelihood) value for θ i.e. the ML 'estimator' f.
=argmaxθp(X∣θ)=argmaxθΠt=1Np(rt∣xt,θ) --- (1)
Recall that if X∼N(μ,σ2), then the pdf for the distribution on X is given by:
p(X)=2πσ1e2σ2−(X−μ)2
If we use this formula in (1), it might make things complicated. Instead, we use the log trick. We can do this because argmax is unaffected by log, for non-negative valued functions. Using a log turns products into sums and gets rid of exponents, making things less complicated.
Now, (1) becomes:
argmaxθlogΠt=1Np(rt∣xt,θ)=argmaxθ∑t=1Nlogp(rt∣xt,θ)=argmaxθ∑t=1Nlog2πσ1e2σ2−(rt−(w1xt+w0))2
=argmaxθ∑t=1Nlog2πσ1(canbeignored)+∑t=1N2σ2−(rt−(w1xt+w0))2=argminθ∑t=1N(rt−(w1xt+w0))2 (ignoring the denominator).
This proves that, under the assumption of independent additive Gaussian noise, the line that denotes the ML estimator (i.e. the ML hypothesis) is the same line that minimizes the MSE.
Last updated