CS-GY 6923: Machine Learning
1.0.0
1.0.0
  • Introduction
  • What is Machine Learning?
  • Types of Machine Learning
    • Supervised Learning
      • Notations
      • Probabilistic Modeling
        • Naive Bayes Classifier
      • Linear Regression
      • Nearest Neighbor
      • Evaluating a Classifier
      • Parametric Estimation
        • Bayesian Approach to Parameter Estimation
        • Parametric Estimation for Simple Linear Regression
        • Parametric Estimation for Multivariate Linear Regression
        • Parametric Estimation for Simple Polynomial Regression
        • Parametric Estimation for Multivariate Polynomial Regression
      • Bias and Variance of an Estimator
      • Bias and Variance of a Regression Algorithm
        • Model Selection
      • Logistic Regression
      • Decision Trees
        • Using Decision Trees for Regression
        • Bias and Variance
      • Dimensionality Reduction
      • Neural Networks
        • Training a Neuron
        • MLP
          • Regression with Multiple Outputs
          • Advice/Tricks and Issues to Train a Neural Network
        • Deep Learning
      • Support Vector Machines
      • Ensemble Learning
    • Unsupervised Learning
      • K-Means Clustering
      • Probabilistic Clustering
    • Reinforcement Learning
Powered by GitBook
On this page
  • How to determine the root node?
  • Preventing Overfitting
  • Gini Index: an alternative to Entropy

Was this helpful?

  1. Types of Machine Learning
  2. Supervised Learning

Decision Trees

PreviousLogistic RegressionNextUsing Decision Trees for Regression

Last updated 5 years ago

Was this helpful?

This classifier is commonly used when the data is not linearly separable.

Decision Trees can be used for either just categorical attributes or just numerical attributes or for a mix of both.

Consider the dataset below:

x1x2label2.34.7+1.92.8+6.14.3−\begin{array}{c|c|c}x_1 & x_2 & label\\ 2.3 & 4.7 & +\\ 1.9 & 2.8 & +\\ 6.1 & 4.3 & -\end{array}x1​2.31.96.1​x2​4.72.84.3​label++−​

Each node of the Decision Tree for the above dataset will be of the form xi>tx_i > txi​>t (for t∈Rt \in Rt∈R), and will have a binary split (N on the left, Y on the right). The leaf nodes will be labels.

The goal is to generate a small tree with pure (all examples for that node have the same label) or mostly pure (majority examples for that node have the same label) leaves.

Larger trees are more prone to overfitting.

It is an NP-hard problem to be able to find the smallest tree either in terms of depth or number of nodes or both. So, we use a greedy heuristic. Therefore, there is no guarantee that we are generating the best tree.

Some advantages of using Decision Trees:

  1. Easy to generate trees

  2. Human-interpretable i.e. a human can visualize the tree and understand the decisions being made at each node to verify if something is wrong

How to determine the root node?

To determine which attribute to use at the root node, we must compute the impurity for each attribute, and choose the one that is most pure.

Consider labels + and -. Highest purity is when P+=1  or  P+=0P_+=1 \,\, or \,\, P_+=0P+​=1orP+​=0 and highest impurity is when P+=0.5P_+ = 0.5P+​=0.5

where P+P_+P+​ is the fraction of examples with +.

We use Entropy to determine purity.

Entropy = −P+log2(P+)−P−log2(P−)Entropy \, = \, -P_+log_2(P_+) - P_-log_2(P_-)Entropy=−P+​log2​(P+​)−P−​log2​(P−​)

If we pick a random example from X, what is the expected entropy of the leaf node where it ends up?

It can be computed as P(split  decision)∗Entropy(split  decision)+P(∼split  decision)∗Entropy(∼split  decision)P(split\,\, decision)*Entropy(split\,\, decision) + P(\sim split\,\, decision)*Entropy(\sim split\,\, decision)P(splitdecision)∗Entropy(splitdecision)+P(∼splitdecision)∗Entropy(∼splitdecision)

For each possible decision leading to a different partition of X, into XYX_YXY​ and XNX_NXN​, choose the decision that minimizes:

P[Y]∗Entropy(XY)+P[N]∗Entropy(XN)P[Y]*Entropy(X_Y)+P[N]*Entropy(X_N)P[Y]∗Entropy(XY​)+P[N]∗Entropy(XN​)

where P[Y]=∣XY∣∣X∣P[Y]=\frac{|X_Y|}{|X|}P[Y]=∣X∣∣XY​∣​ and P[N]=∣XN∣∣X∣P[N] = \frac{|X_N|}{|X|}P[N]=∣X∣∣XN​∣​,

and XY={x∈X∣decision  on  x  is  Y}X_Y = \{x\in X| decision\,\, on \,\, x \,\, is \,\, Y\}XY​={x∈X∣decisiononxisY}, XN={x∈X∣decision  on  x  is  N}X_N = \{x\in X| decision\,\, on \,\, x \,\, is \,\, N\}XN​={x∈X∣decisiononxisN}

Note: If we have categorical attributes, instead of Y and N, we will have the values for that attribute (one pre branch).

Preventing Overfitting

There are a few methods to prevent overfitting in Decision Trees:

  1. Stop Growing Tree (Pre-Pruning): stop growing the tree before reaching purity in each leaf. Form leaf if either:

    • number of training examples reaching leaf is < t1 (we choose t1)

    • Entropy of set of training examples reaching node is < t2 (we choose t2)

  2. Post-Pruning: grow the tree to purity and then cut off subtrees and replace with leaves. This technique needs a validation set to determine where and how to prune.

Gini Index: an alternative to Entropy

It can be computed using 4P+(1−P+)4P_+(1-P_+)4P+​(1−P+​)

Sometimes, the 4 is omitted.