Model Selection

This refers to selecting an appropriate model for the task at hand.

For example, for regression, should we use Linear Regression? Polynomial Regression of degree 2? Polynomial Regression of degree 3? and so on.

Cross-Validation

In very low dimensions, we may be able to visualize/plot the data. We may be tempted to compute the squared errors for Linear Regression and Polynomial Regression and choose the one that performs better. However, this should not be done! This is because the squared error may be low on the training data, but may turn out to be extremely high on the test data.

Instead, we must perform cross-validation:

  • divide the dataset into training and validation sets

  • train each model on the training set

  • compute the error of the resulting g on the validation set

  • choose the model that minimizes the error on the validation set

Regularization

If the number of input variables is large (i.e. large dimension d), then Linear Regression learns a lot of coefficients. Sometimes, these coefficients can be absurdly large or small. This is a sign of overfitting.

In such cases, we can set some coefficients to 0, to simplify g.

We must find the hypothesis (i.e. the linear function) that minimizes the regularized error function given by:

E1=errorondata+λ(modelcomplexity)E^1=error\, on\, data + \lambda * (model\, complexity) where 'error on data' can be squared error, λ\lambda is a tunable parameter called the regularization parameter and the model complexity (for a linear function) can be given by i=1dwi\sum_{i=1}^d |w_i|. The value of λ\lambda can be a default value or can be determined using cross-validation.

It can be shown that the hypothesis that maximizes E1E^1 is the MAP hypothesis, for a suitable prior.

Linear Discriminant Analysis

This is mainly used for classification problems.

Think of it as computing a 'score' for an example which is a weighted sum of the attributes.

Say we have 3 classes C1,C2,C3C_1,C_2,C_3.

The score for class i is given by gi(xwi,wi0)=wi2x2+wi1x1+wi0g_i(x|w_i, w_{i0}) = w_{i2}x_2+w_{i1}x_1+w_{i0} where x=[x1x2]x=\begin{bmatrix}x_1\\x_2\end{bmatrix}. It can be computed using gi(x)=wiTx+wi0g_i(x)=w_i^Tx+w_{i0}

Given g1(x),g2(x),g3(x)g_1(x), g_2(x), g_3(x), we must predict the class for x based on the class that maximizes gi(x)g_i(x) i.e. the argmaxigi(x)argmax_i\, g_i(x).

We must learn a gi(x)g_i(x) that can hopefully make accurate predictions on new examples. This linear discriminant function linearly separates the examples that belong to class i and don't belong to class i. (say examples above the line belong to the class and examples below the line do not).

Consider a problem with two classes C1C_1(+) and C2C_2 (-).

In a generative approach, we attempt to learn/model distributions p(x|+) and p(x|-).

In a discriminative approach, we don't learn/model p(x|+) and p(x|-). We only attempt to discriminate between + and -.

Let yP(C1x),1y=P(C2x)y \equiv P(C_1|x), \, 1-y = P(C_2|x).

Choose class C1C_1 if y>0.5,y1y>1,orlog(y1y)>0y\gt 0.5,\frac{y}{1-y}>1, \, or \,\, log(\frac{y}{1-y})>0. Otherwise, choose C2.C_2.

The Logit or Log Odds function is given by logit(y)=log(y1y)logit(y)=log(\frac{y}{1-y}) for 0<y<1 Its inverse is the logistic function, also called the sigmoid function i.e. sigmoid(z)=11+ezsigmoid(z)=\frac{1}{1+e^{-z}}

Last updated