Evaluating a Classifier

Train-Test Split

Given a dataset, we first split it into a training set (say 70%) and a test set (say 30%). We must make sure to randomly shuffle the examples before splitting, to avoid cases such as all the examples of a certain class being together in the set.

We train using the training set and compute the error (and accuracy) on the test set.

Note: The main aim is to generalize to unseen examples. If we train using the entire dataset, the classifier will most probably overfit i.e. it will classify training examples fairly well but wouldn't be able to generalize to unseen test examples.

A good practice is to compute both the training error and the test error.

If training error << test error, the model is probably overfitting!

k-fold Cross-Validation

This is mainly useful when we have a small dataset. Splitting a small dataset would reduce the amount of data available for both training and testing.

Instead of splitting the dataset into a train and a test set, do the following:

divide the dataset into k equal subsets.

for i = 1 to k:
    train on examples in all subsets except S_i
    get hypothesis h_i
    use h_i to predict labels of examples in S_i
    compute the error for h_i

compute the average error of all the h_i's

(k=5 is common, though k=10, 12 are also used)

This way, every example in the dataset is eventually used for both training and testing.

Note: A version of cross validation can also be used to choose the values of the parameters of a learning algorithm. For example, choosing k for kNN:

- establish a range of possible values for k: say 1, 3, 5, 7, 9
- put some examples from the training set into the validation set
- for each k, train on the examples not in the validation set
- test on the examples in the validation set
- choose the k with the lowest error on the validation set

Leave-one-out cross validation uses k=N. It is also called N-fold CV. It is effective, but time consuming.

ROC Curve

A confusion matrix gives the TP, FP, TN, FN. We can compute the TPR and FPR.

An ROC curve plots the TPR vs. the FPR at different thresholds.

The dotted line corresponds to 50% accuracy i.e. as good as random guessing for a binary classification.

The AUC (Area Under Curve) corresponds to the accuracy, so the higher the curve is above the dotted line, the higher is the accuracy of the classifier.

At TPR=FPR=0, all examples were labeled negative. At FPR=TPR=1, all examples were labeled positive.

At FPR=0, TPR=1, all examples were classified correctly! (ROC Heaven!)

At FPR=1, TPR=0, all examples were misclassified! (ROC Hell!)

At ROC Heaven, AUC=1 (max. possible AUC). At ROC Hell, AUC=0 (min. possible AUC).

Last updated