Decision Trees
Last updated
Last updated
This classifier is commonly used when the data is not linearly separable.
Decision Trees can be used for either just categorical attributes or just numerical attributes or for a mix of both.
Consider the dataset below:
Each node of the Decision Tree for the above dataset will be of the form (for ), and will have a binary split (N on the left, Y on the right). The leaf nodes will be labels.
The goal is to generate a small tree with pure (all examples for that node have the same label) or mostly pure (majority examples for that node have the same label) leaves.
Larger trees are more prone to overfitting.
It is an NP-hard problem to be able to find the smallest tree either in terms of depth or number of nodes or both. So, we use a greedy heuristic. Therefore, there is no guarantee that we are generating the best tree.
Some advantages of using Decision Trees:
Easy to generate trees
Human-interpretable i.e. a human can visualize the tree and understand the decisions being made at each node to verify if something is wrong
To determine which attribute to use at the root node, we must compute the impurity for each attribute, and choose the one that is most pure.
We use Entropy to determine purity.
If we pick a random example from X, what is the expected entropy of the leaf node where it ends up?
Note: If we have categorical attributes, instead of Y and N, we will have the values for that attribute (one pre branch).
There are a few methods to prevent overfitting in Decision Trees:
Stop Growing Tree (Pre-Pruning): stop growing the tree before reaching purity in each leaf. Form leaf if either:
number of training examples reaching leaf is < t1 (we choose t1)
Entropy of set of training examples reaching node is < t2 (we choose t2)
Post-Pruning: grow the tree to purity and then cut off subtrees and replace with leaves. This technique needs a validation set to determine where and how to prune.
Sometimes, the 4 is omitted.
Consider labels + and -. Highest purity is when and highest impurity is when
where is the fraction of examples with +.
It can be computed as
For each possible decision leading to a different partition of X, into and , choose the decision that minimizes:
where and ,
and ,
It can be computed using