Support Vector Machines

SVMs are also called kernel methods.

Given linearly separable data (with real-valued attributes), we must find a linear discriminant function that separates the positives and negatives (i.e. the classes) and is as far as possible from any example in the data.

Therefore, SVMs aim to maximize the distance of the decision boundary to the closest point in the data set. This distance is called the margin of the hyperplane, and is denoted by $\rho$ .

The line that achieves maximum margin is called the maximum margin hyperplane.

Computing the Equation for the Maximum Margin Hyperplane

Consider the data below (positive class: +1, negative class: -1)

The hyperplane equation is g(x) = 0.

For our data, we have the following linear constraints:

The solution gives the weights of the maximum margin hyperplane.

This is, however, the theoretical computation. SVM softwares compute slightly differently.

Let I be the indices of the support vectors. The SVM software outputs g(x) as a function of the support vectors:

So, we have two representtions of the maximum margin hyperplane:

SVM with Non-Linearly-Separable Data

The above works when the data is linearly-separable. What if the data is not linearly separable?

Then, the above technique can be used, since the data (in terms of z) is linearly separable.

However, this cannot always be done.

The idea is to map the data to a new space where it becomes linearly separable. This new space is an infinite-dimensional space.

Then, we have:

Some commonly-used kernel functions:

Polynomial Kernel (of degree d):
The hope is that the data becomes linearly separable by a d-degree polynomial in the attributes.
d is a tunable parameter.
Gaussian Kernel (Radial Basis Function RBF Kernel):
Using a Gaussian kernel is analogous to computing a distance-weighted sum (similar to KNN). It is the most suitable kernel when the data forms a checker-board pattern. To use a Polynomial kernel for the same data, we would need to use high degree d.
The Gaussian kernel is the most commonly-used kernel.

Note: If the number of support vectors returned by the SVM software is not much smaller than the size of the training set, it is a clear sign of overfitting.

Soft-Margin Hyperplanes

The previously discussed hyperplane refers to a hard-margin hyperplane. A soft-margin hyperplane allows examples to be in the margin region, as well as on the wrong side of the hyperplane. To accomodate for the examples that are on the wrong side of the hyperplane, we penalize by the distance of those examples.

For a soft-margin hyperplane, we do the following:

PreviousDeep Learning NextEnsemble Learning

Last updated 4 years ago