Model Selection

This refers to selecting an appropriate model for the task at hand.

For example, for regression, should we use Linear Regression? Polynomial Regression of degree 2? Polynomial Regression of degree 3? and so on.

Cross-Validation

In very low dimensions, we may be able to visualize/plot the data. We may be tempted to compute the squared errors for Linear Regression and Polynomial Regression and choose the one that performs better. However, this should not be done! This is because the squared error may be low on the training data, but may turn out to be extremely high on the test data.

Instead, we must perform cross-validation:

divide the dataset into training and validation sets
train each model on the training set
compute the error of the resulting g on the validation set
choose the model that minimizes the error on the validation set

Regularization

If the number of input variables is large (i.e. large dimension d), then Linear Regression learns a lot of coefficients. Sometimes, these coefficients can be absurdly large or small. This is a sign of overfitting.

In such cases, we can set some coefficients to 0, to simplify g.

We must find the hypothesis (i.e. the linear function) that minimizes the regularized error function given by:

$E^1=error\, on\, data + \lambda * (model\, complexity)$ where 'error on data' can be squared error, $\lambda$ is a tunable parameter called the regularization parameter and the model complexity (for a linear function) can be given by $\sum_{i=1}^d |w_i|$ . The value of $\lambda$ can be a default value or can be determined using cross-validation.

It can be shown that the hypothesis that maximizes $E^1$ is the MAP hypothesis, for a suitable prior.

Linear Discriminant Analysis

This is mainly used for classification problems.

Think of it as computing a 'score' for an example which is a weighted sum of the attributes.

Say we have 3 classes $C_1,C_2,C_3$ .

The score for class i is given by $g_i(x|w_i, w_{i0}) = w_{i2}x_2+w_{i1}x_1+w_{i0}$ where $x=\begin{bmatrix}x_1\\x_2\end{bmatrix}$ . It can be computed using $g_i(x)=w_i^Tx+w_{i0}$

Given $g_1(x), g_2(x), g_3(x)$ , we must predict the class for x based on the class that maximizes $g_i(x)$ i.e. the $argmax_i\, g_i(x)$ .

We must learn a $g_i(x)$ that can hopefully make accurate predictions on new examples. This linear discriminant function linearly separates the examples that belong to class i and don't belong to class i. (say examples above the line belong to the class and examples below the line do not).

Consider a problem with two classes $C_1$ (+) and $C_2$ (-).

In a generative approach, we attempt to learn/model distributions p(x|+) and p(x|-).

In a discriminative approach, we don't learn/model p(x|+) and p(x|-). We only attempt to discriminate between + and -.

Let $y \equiv P(C_1|x), \, 1-y = P(C_2|x)$ .

Choose class $C_1$ if $y\gt 0.5,\frac{y}{1-y}>1, \, or \,\, log(\frac{y}{1-y})>0$ . Otherwise, choose $C_2.$

The Logit or Log Odds function is given by $logit(y)=log(\frac{y}{1-y})$ for 0<y<1 Its inverse is the logistic function, also called the sigmoid function i.e. $sigmoid(z)=\frac{1}{1+e^{-z}}$

PreviousBias and Variance of a Regression Algorithm NextLogistic Regression

Last updated 5 years ago

Was this helpful?