CS-GY 6923: Machine Learning
1.0.0
1.0.0
  • Introduction
  • What is Machine Learning?
  • Types of Machine Learning
    • Supervised Learning
      • Notations
      • Probabilistic Modeling
        • Naive Bayes Classifier
      • Linear Regression
      • Nearest Neighbor
      • Evaluating a Classifier
      • Parametric Estimation
        • Bayesian Approach to Parameter Estimation
        • Parametric Estimation for Simple Linear Regression
        • Parametric Estimation for Multivariate Linear Regression
        • Parametric Estimation for Simple Polynomial Regression
        • Parametric Estimation for Multivariate Polynomial Regression
      • Bias and Variance of an Estimator
      • Bias and Variance of a Regression Algorithm
        • Model Selection
      • Logistic Regression
      • Decision Trees
        • Using Decision Trees for Regression
        • Bias and Variance
      • Dimensionality Reduction
      • Neural Networks
        • Training a Neuron
        • MLP
          • Regression with Multiple Outputs
          • Advice/Tricks and Issues to Train a Neural Network
        • Deep Learning
      • Support Vector Machines
      • Ensemble Learning
    • Unsupervised Learning
      • K-Means Clustering
      • Probabilistic Clustering
    • Reinforcement Learning
Powered by GitBook
On this page
  • Cross-Validation
  • Regularization
  • Linear Discriminant Analysis

Was this helpful?

  1. Types of Machine Learning
  2. Supervised Learning
  3. Bias and Variance of a Regression Algorithm

Model Selection

This refers to selecting an appropriate model for the task at hand.

For example, for regression, should we use Linear Regression? Polynomial Regression of degree 2? Polynomial Regression of degree 3? and so on.

Cross-Validation

In very low dimensions, we may be able to visualize/plot the data. We may be tempted to compute the squared errors for Linear Regression and Polynomial Regression and choose the one that performs better. However, this should not be done! This is because the squared error may be low on the training data, but may turn out to be extremely high on the test data.

Instead, we must perform cross-validation:

  • divide the dataset into training and validation sets

  • train each model on the training set

  • compute the error of the resulting g on the validation set

  • choose the model that minimizes the error on the validation set

Regularization

If the number of input variables is large (i.e. large dimension d), then Linear Regression learns a lot of coefficients. Sometimes, these coefficients can be absurdly large or small. This is a sign of overfitting.

In such cases, we can set some coefficients to 0, to simplify g.

We must find the hypothesis (i.e. the linear function) that minimizes the regularized error function given by:

E1=error on data+λ∗(model complexity)E^1=error\, on\, data + \lambda * (model\, complexity)E1=errorondata+λ∗(modelcomplexity) where 'error on data' can be squared error, λ\lambdaλ is a tunable parameter called the regularization parameter and the model complexity (for a linear function) can be given by ∑i=1d∣wi∣\sum_{i=1}^d |w_i|∑i=1d​∣wi​∣. The value of λ\lambdaλ can be a default value or can be determined using cross-validation.

It can be shown that the hypothesis that maximizes E1E^1E1 is the MAP hypothesis, for a suitable prior.

Linear Discriminant Analysis

This is mainly used for classification problems.

Think of it as computing a 'score' for an example which is a weighted sum of the attributes.

Say we have 3 classes C1,C2,C3C_1,C_2,C_3C1​,C2​,C3​.

The score for class i is given by gi(x∣wi,wi0)=wi2x2+wi1x1+wi0g_i(x|w_i, w_{i0}) = w_{i2}x_2+w_{i1}x_1+w_{i0}gi​(x∣wi​,wi0​)=wi2​x2​+wi1​x1​+wi0​ where x=[x1x2]x=\begin{bmatrix}x_1\\x_2\end{bmatrix}x=[x1​x2​​]. It can be computed using gi(x)=wiTx+wi0g_i(x)=w_i^Tx+w_{i0}gi​(x)=wiT​x+wi0​

Given g1(x),g2(x),g3(x)g_1(x), g_2(x), g_3(x)g1​(x),g2​(x),g3​(x), we must predict the class for x based on the class that maximizes gi(x)g_i(x)gi​(x) i.e. the argmaxi gi(x)argmax_i\, g_i(x)argmaxi​gi​(x).

We must learn a gi(x)g_i(x)gi​(x) that can hopefully make accurate predictions on new examples. This linear discriminant function linearly separates the examples that belong to class i and don't belong to class i. (say examples above the line belong to the class and examples below the line do not).

Consider a problem with two classes C1C_1C1​(+) and C2C_2C2​ (-).

In a generative approach, we attempt to learn/model distributions p(x|+) and p(x|-).

In a discriminative approach, we don't learn/model p(x|+) and p(x|-). We only attempt to discriminate between + and -.

Let y≡P(C1∣x), 1−y=P(C2∣x)y \equiv P(C_1|x), \, 1-y = P(C_2|x)y≡P(C1​∣x),1−y=P(C2​∣x).

Choose class C1C_1C1​ if y>0.5,y1−y>1, or  log(y1−y)>0y\gt 0.5,\frac{y}{1-y}>1, \, or \,\, log(\frac{y}{1-y})>0y>0.5,1−yy​>1,orlog(1−yy​)>0. Otherwise, choose C2.C_2.C2​.

The Logit or Log Odds function is given by logit(y)=log(y1−y)logit(y)=log(\frac{y}{1-y})logit(y)=log(1−yy​) for 0<y<1 Its inverse is the logistic function, also called the sigmoid function i.e. sigmoid(z)=11+e−zsigmoid(z)=\frac{1}{1+e^{-z}}sigmoid(z)=1+e−z1​

PreviousBias and Variance of a Regression AlgorithmNextLogistic Regression

Last updated 5 years ago

Was this helpful?