CS-GY 6923: Machine Learning
1.0.0
1.0.0
  • Introduction
  • What is Machine Learning?
  • Types of Machine Learning
    • Supervised Learning
      • Notations
      • Probabilistic Modeling
        • Naive Bayes Classifier
      • Linear Regression
      • Nearest Neighbor
      • Evaluating a Classifier
      • Parametric Estimation
        • Bayesian Approach to Parameter Estimation
        • Parametric Estimation for Simple Linear Regression
        • Parametric Estimation for Multivariate Linear Regression
        • Parametric Estimation for Simple Polynomial Regression
        • Parametric Estimation for Multivariate Polynomial Regression
      • Bias and Variance of an Estimator
      • Bias and Variance of a Regression Algorithm
        • Model Selection
      • Logistic Regression
      • Decision Trees
        • Using Decision Trees for Regression
        • Bias and Variance
      • Dimensionality Reduction
      • Neural Networks
        • Training a Neuron
        • MLP
          • Regression with Multiple Outputs
          • Advice/Tricks and Issues to Train a Neural Network
        • Deep Learning
      • Support Vector Machines
      • Ensemble Learning
    • Unsupervised Learning
      • K-Means Clustering
      • Probabilistic Clustering
    • Reinforcement Learning
Powered by GitBook
On this page
  • Computing the Equation for the Maximum Margin Hyperplane
  • SVM with Non-Linearly-Separable Data
  • Soft-Margin Hyperplanes

Was this helpful?

  1. Types of Machine Learning
  2. Supervised Learning

Support Vector Machines

PreviousDeep LearningNextEnsemble Learning

Last updated 5 years ago

Was this helpful?

SVMs are also called kernel methods.

Given linearly separable data (with real-valued attributes), we must find a linear discriminant function that separates the positives and negatives (i.e. the classes) and is as far as possible from any example in the data.

Therefore, SVMs aim to maximize the distance of the decision boundary to the closest point in the data set. This distance is called the margin of the hyperplane, and is denoted by ρ\rhoρ.

The line that achieves maximum margin is called the maximum margin hyperplane.

Computing the Equation for the Maximum Margin Hyperplane

Consider the data below (positive class: +1, negative class: -1)

x1x2r11−122+111/2−132+127/4+131+1\begin{array}{c|c|c} x_1 & x_2 & r\\1 & 1 & -1\\2 & 2 & +1\\ 1 & 1/2 & -1\\3 & 2 & +1\\2 & 7/4 & +1\\3 & 1 & +1\end{array}x1​121323​x2​121/227/41​r−1+1−1+1+1+1​

Let g(x)=wTx+w0g(x) = w^Tx + w_0g(x)=wTx+w0​

The hyperplane equation is g(x) = 0.

The distance from point x to the hyperplane is g(x)∣∣w∣∣2\frac{g(x)}{||w||_2}∣∣w∣∣2​g(x)​=g(x)w12+w22+...+wd2\frac{g(x)}{\sqrt{w_1^2+w_2^2+...+w_d^2}}w12​+w22​+...+wd2​​g(x)​ (note that there is no w0w_0w0​ in the denominator).

We need to determine w,w0w, w_0w,w0​ such that wTxt+w0>0w^Tx^t + w_0 > 0wTxt+w0​>0 if rt=+1r^t=+1rt=+1 and wTxt+w0≤0w^Tx^t + w_0 \leq 0wTxt+w0​≤0 if rt=−1r^t=-1rt=−1.

For our data, we have the following linear constraints:

w1+w2+w0≤0w_1 + w_2 + w_0 \leq 0w1​+w2​+w0​≤0

2w1+2w2+w0>02w_1 + 2w_2 + w_0 > 02w1​+2w2​+w0​>0

w1+12w2+w0≤0w_1 + \frac{1}{2}w_2 + w_0 \leq 0w1​+21​w2​+w0​≤0, and so on.

For the maximum margin hyperplane, we have g(x)∣∣w∣∣2=ρ\frac{g(x)}{||w||_2} = \rho∣∣w∣∣2​g(x)​=ρ

Suppose we scale the coefficients of g, such that the new g(x)=1 (let's call it g∼(x)g^\sim(x)g∼(x)) for the points closest to the hyperplane. To do so, we must divide the coefficients of g by a certain quantity. Which quantity to use? Maybe the margin ρ\rhoρ?

Now, replace g(x) by g∼(x)g^\sim(x)g∼(x), we have new g(x) = −+1^+_-1−+​1 for the points closest to the hyperplane. These points are called support vectors. Typically, we have at least 3 support vectors, some of which are on the + side and some are on the - side. The intuiton behind calling them support vectors is that they support the hyperplane in place, and the hyperplane will move if any of them move.

So, for the support vectors, g(xt)=+1  if  rt=+1g(x^t) = +1\,\,if\,\,r^t=+1g(xt)=+1ifrt=+1 and g(xt)=−1  if  rt=−1g(x^t) = -1\,\,if\,\,r^t=-1g(xt)=−1ifrt=−1. For other examples:

g(xt)>+1  if  rt=+1g(x^t) > +1 \,\, if \,\,r^t=+1g(xt)>+1ifrt=+1

g(xt)<−1  if  rt=−1g(x^t) < -1 \,\,if \,\,r^t=-1g(xt)<−1ifrt=−1

Maximum margin hyperplane on support vectors x satisfies the cannonical form of the maximum margin hyperplane i.e. g(x) = −+1^+_-1−+​1. So, ρ=−+1∣∣w∣∣2\rho = \frac{^+_-1}{||w||_2}ρ=∣∣w∣∣2​−+​1​.

Therefore, we must maximize −+1w12+w22+...+wd2\frac{^+_-1}{\sqrt{w_1^2+w_2^2+...+w_d^2}}w12​+w22​+...+wd2​​−+​1​ subject to the constraints:

wTx+w0≥+1  for  all  xt  where  rt=+1w^Tx+w_0 \geq +1\,\, for\,\,all\,\,x^t\,\,where\,\,r^t=+1wTx+w0​≥+1forallxtwherert=+1

wTx+w0≤−1  for  all  xt  where  rt=−1w^Tx+w_0 \leq -1\,\, for\,\,all\,\,x^t\,\,where\,\,r^t=-1wTx+w0​≤−1forallxtwherert=−1

The solution gives the weights of the maximum margin hyperplane.

(to simplify, we minimize w12+w22+...+wd2w_1^2+w_2^2+...+w_d^2w12​+w22​+...+wd2​).

This is, however, the theoretical computation. SVM softwares compute slightly differently.

SVM softwares often don't directly output the weights. Instead, they output a dual representation of the hyperplane g(x)=0 where g(x)=wTx+w0w^Tx+w_0wTx+w0​:

Let I be the indices of the support vectors. The SVM software outputs g(x) as a function of the support vectors:

g(x)=[∑t∈I(vt[xt])T.x]+v0g(x) = [\sum_{t\in I} (v_t[x^t])^T.x]+v_0g(x)=[∑t∈I​(vt​[xt])T.x]+v0​

where x=[x1x2]x = \begin{bmatrix}x_1\\x_2\end{bmatrix}x=[x1​x2​​] i.e. the example to be classified.

Suppose the support vectors are [27/4],[31],[11]\begin{bmatrix}2\\7/4\end{bmatrix}, \begin{bmatrix}3\\1\end{bmatrix}, \begin{bmatrix}1\\1\end{bmatrix}[27/4​],[31​],[11​]say x19,x5,x72x^{19}, x^{5}, x^{72}x19,x5,x72 respectively, it will output coefficients for each support vector i.e. v19=3,v5=9,v72=−4v_{19}=3, v_5=9, v_{72}=-4v19​=3,v5​=9,v72​=−4 respectively, and v0=3v_0=3v0​=3(say).

The software will, therefore, output g(x). The coefficients of x1,x2x_1, x_2x1​,x2​ in g(x) give the weights w1,w2w_1, w_2w1​,w2​ and the constant term will be w0w_0w0​.

So, we have two representtions of the maximum margin hyperplane:

The Primal Representation g(x)=∑i=1dwixi+w0g(x) = \sum_{i=1}^d w_ix_i + w_0g(x)=∑i=1d​wi​xi​+w0​

The Dual Representation g(x)=[∑t∈I(vt[xt])T.x]+v0g(x) = [\sum_{t\in I} (v_t[x^t])^T.x]+v_0g(x)=[∑t∈I​(vt​[xt])T.x]+v0​

(vt=0v_t=0vt​=0 for all t where xtx^txt is not a support vector).

SVM with Non-Linearly-Separable Data

The above works when the data is linearly-separable. What if the data is not linearly separable?

Say the data follows x12+x22=1x_1^2+x_2^2=1x12​+x22​=1. We can make the data linearly separable using z1=x12,z2=x22z_1=x_1^2, z_2=x_2^2z1​=x12​,z2​=x22​.

g(x)=−x12−x22+1⇒g(z)=−z1−z2+1g(x) = -x_1^2-x_2^2+1\Rightarrow g(z) = -z_1-z_2+1g(x)=−x12​−x22​+1⇒g(z)=−z1​−z2​+1

Then, the above technique can be used, since the data (in terms of z) is linearly separable.

However, this cannot always be done.

The idea is to map the data to a new space where it becomes linearly separable. This new space is an infinite-dimensional space.

Let zt=ϕ(xt)z^t=\phi(x^t)zt=ϕ(xt) where ϕ\phiϕ is a mapping function.

Then, we have:

g(ϕ(x))=[∑t∈Ivtϕ(xt)].ϕ(xt)+v0=[∑t∈Ivt[ϕ(xt).ϕ(x)]]+v0g(\phi(x))=[\sum_{t\in I} v_t\phi(x^t)].\phi(x^t)+v_0 = [\sum_{t\in I} v_t[\phi(x^t).\phi(x)]] + v_0g(ϕ(x))=[∑t∈I​vt​ϕ(xt)].ϕ(xt)+v0​=[∑t∈I​vt​[ϕ(xt).ϕ(x)]]+v0​

Suppose we have an easy way to compute K(x,y)=ϕ(x).ϕ(y)K(x,y)=\phi(x).\phi(y)K(x,y)=ϕ(x).ϕ(y), for x,y in the original space, K is called a kernel function. Then, we have:

g(ϕ(x))=[∑t∈Ivt[K(xt,x)]]+v0g(\phi(x)) = [\sum_{t\in I} v_t[K(x^t,x)]]+v_0g(ϕ(x))=[∑t∈I​vt​[K(xt,x)]]+v0​ ------------------ (1)

The kernel trick is to avoid computing ϕ\phiϕ in the new space, by using a kernel function.

Some commonly-used kernel functions:

  • Linear Kernel: K(x,y)=x.yK(x,y)=x.yK(x,y)=x.y

  • Polynomial Kernel (of degree d):

    For d=2, x1,x2→x12,x22,x1x2,x1,x2x_1, x_2 \rightarrow x_1^2, x_2^2, x_1x_2, x_1, x_2x1​,x2​→x12​,x22​,x1​x2​,x1​,x2​

    Consider K(x,y)=K(x1,x2,y1,y2)K(x,y)=K(x_1,x_2,y_1,y_2)K(x,y)=K(x1​,x2​,y1​,y2​)

    Let ϕnew(x)=[12x12x22x1x2x12x22]\phi^{new}(x) = \begin{bmatrix}1\\\sqrt{2}x_1\\\sqrt{2}x_2\\\sqrt{2}x_1x_2\\x_1^2\\x_2^2\end{bmatrix}ϕnew(x)=​12​x1​2​x2​2​x1​x2​x12​x22​​​(this makes it easier to compute), then, Knew(x,y)=ϕnew(x).ϕnew(y)K^{new}(x,y)=\phi^{new}(x).\phi^{new}(y)Knew(x,y)=ϕnew(x).ϕnew(y)=(1+x.y)2(1+x.y)^2(1+x.y)2

    In general, if the degree is d, the kernel function is K(x,y)=(1+x.y)dK(x,y)=(1+x.y)^dK(x,y)=(1+x.y)d

    The hope is that the data becomes linearly separable by a d-degree polynomial in the attributes.

    Another version without lower order terms is K(x,y)=(x.y)dK(x,y)=(x.y)^dK(x,y)=(x.y)d

    d is a tunable parameter.

  • Gaussian Kernel (Radial Basis Function RBF Kernel):

    K(x,y)=e−∣∣x−y∣∣22σ2K(x,y)=e^{-\frac{||x-y||^2}{2\sigma^2}}K(x,y)=e−2σ2∣∣x−y∣∣2​ where ∣∣x−y∣∣2||x-y||^2∣∣x−y∣∣2 is the L2 norm i.e. basically the distance between x and y.

    σ\sigmaσ is a tunable parameter.

    Using a Gaussian kernel is analogous to computing a distance-weighted sum (similar to KNN). It is the most suitable kernel when the data forms a checker-board pattern. To use a Polynomial kernel for the same data, we would need to use high degree d.

    The Gaussian kernel is the most commonly-used kernel.

    It can also be computed as K(x,y)=e−γ∣∣x−y∣∣2K(x,y)=e^{-\gamma ||x-y||^2}K(x,y)=e−γ∣∣x−y∣∣2 where γ=12σ2\gamma = \frac{1}{2\sigma^2}γ=2σ21​. Small σ\sigmaσ means a narrow Gaussian, while small γ\gammaγ means a flat Gaussian.

In general, we predict + if g(ϕ(x))≥0g(\phi(x))\geq 0g(ϕ(x))≥0 and predict - otherwise. (from (1)).

Note: If the number of support vectors returned by the SVM software is not much smaller than the size of the training set, it is a clear sign of overfitting.

Soft-Margin Hyperplanes

The previously discussed hyperplane refers to a hard-margin hyperplane. A soft-margin hyperplane allows examples to be in the margin region, as well as on the wrong side of the hyperplane. To accomodate for the examples that are on the wrong side of the hyperplane, we penalize by the distance of those examples.

For a hard-margin hyperplane, we minimize ∣∣w∣∣2||w||^2∣∣w∣∣2 subject to:

wTxt≥1  if  yt=+1w^Tx^t\geq 1\,\, if\,\, y^t=+1wTxt≥1ifyt=+1, and wTxt≤−1  if  yt=−1w^Tx^t \leq -1\,\, if\,\, y^t=-1wTxt≤−1ifyt=−1 Equivalently, yt∗(wTxt)≥1  for  all  xt∈Xy^t*(w^Tx^t)\geq 1\,\,for\,\,all\,\,x^t\in Xyt∗(wTxt)≥1forallxt∈X

For a soft-margin hyperplane, we do the following:

min  ∣∣w∣∣2+λ∑tεtmin\,\, ||w||^2 + \lambda \sum_t \varepsilon^tmin∣∣w∣∣2+λ∑t​εt, subject to:

yt∗(wTxt)≥1−εty^t*(w^Tx^t)\geq 1-\varepsilon^tyt∗(wTxt)≥1−εt; (εt≥0  for  all  t)(\varepsilon^t\geq 0 \,\, for\,\, all\,\, t)(εt≥0forallt)

λ\lambdaλ is the penalty paramater. Note that a large penalty may cause overfitting.