Decision Trees

This classifier is commonly used when the data is not linearly separable.

Decision Trees can be used for either just categorical attributes or just numerical attributes or for a mix of both.

Consider the dataset below:

$\begin{array}{c|c|c}x_1 & x_2 & label\\ 2.3 & 4.7 & +\\ 1.9 & 2.8 & +\\ 6.1 & 4.3 & -\end{array}$

Each node of the Decision Tree for the above dataset will be of the form $x_i > t$ (for $t \in R$ ), and will have a binary split (N on the left, Y on the right). The leaf nodes will be labels.

The goal is to generate a small tree with pure (all examples for that node have the same label) or mostly pure (majority examples for that node have the same label) leaves.

Larger trees are more prone to overfitting.

It is an NP-hard problem to be able to find the smallest tree either in terms of depth or number of nodes or both. So, we use a greedy heuristic. Therefore, there is no guarantee that we are generating the best tree.

Some advantages of using Decision Trees:

Easy to generate trees
Human-interpretable i.e. a human can visualize the tree and understand the decisions being made at each node to verify if something is wrong

How to determine the root node?

To determine which attribute to use at the root node, we must compute the impurity for each attribute, and choose the one that is most pure.

Consider labels + and -. Highest purity is when $P_+=1 \,\, or \,\, P_+=0$ and highest impurity is when $P_+ = 0.5$

where $P_+$ is the fraction of examples with +.

We use Entropy to determine purity.

$Entropy \, = \, -P_+log_2(P_+) - P_-log_2(P_-)$

If we pick a random example from X, what is the expected entropy of the leaf node where it ends up?

It can be computed as $P(split\,\, decision)*Entropy(split\,\, decision) + P(\sim split\,\, decision)*Entropy(\sim split\,\, decision)$

For each possible decision leading to a different partition of X, into $X_Y$ and $X_N$ , choose the decision that minimizes:

$P[Y]*Entropy(X_Y)+P[N]*Entropy(X_N)$

where $P[Y]=\frac{|X_Y|}{|X|}$ and $P[N] = \frac{|X_N|}{|X|}$ ,

and $X_Y = \{x\in X| decision\,\, on \,\, x \,\, is \,\, Y\}$ , $X_N = \{x\in X| decision\,\, on \,\, x \,\, is \,\, N\}$

Note: If we have categorical attributes, instead of Y and N, we will have the values for that attribute (one pre branch).

Preventing Overfitting

There are a few methods to prevent overfitting in Decision Trees:

Stop Growing Tree (Pre-Pruning): stop growing the tree before reaching purity in each leaf. Form leaf if either:
- number of training examples reaching leaf is < t1 (we choose t1)
- Entropy of set of training examples reaching node is < t2 (we choose t2)
Post-Pruning: grow the tree to purity and then cut off subtrees and replace with leaves. This technique needs a validation set to determine where and how to prune.

Gini Index: an alternative to Entropy

It can be computed using $4P_+(1-P_+)$

Sometimes, the 4 is omitted.

PreviousLogistic Regression NextUsing Decision Trees for Regression

Last updated 5 years ago

Was this helpful?