Chapter 6 : Decision Trees

Notes

Training and Visualizing a Decision

They’re a fundamental components of Random Forests

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X = iris.data[:, 2:]
y = iris.target

tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X, y)

DecisionTreeClassifier(max_depth=2)

Making Predictions

As the diagram shows first it splits the dataset based on whether or not petal length <= 2.45
- If True, we reach the leaf node where the class is 100% setosa
- Else we continue the the next questions which further separates the based on the petal width
Decision Trees do not require feature scaling or centering at all
Scikit-learn uses the CART algorithm which is a binary tree, but other algorithms (e.g. ID3) can produce trees with 3+ leaves

Gini Impurity

Gini Impurity is a measure of how impure the node is
- The impurity is zero when all training instances it applies to belong to the same class (e.g. Left most node above)

\[G_i = 1 - \sum_{k=1}^{n} p_{i,k}^2\]

$p_{i,k}$ is the ratio of the class k instances among the training instance in the ith node.

Here is the decision boundary created by the tree:

Decision Trees are considered a white box model (as opposed to black box) because the it’s easy to retrace why decisions were made by the model, whereas in neural network that is not as feasible

Estimating Class Probabilities

It assigns probabilities simply based on the fraction of instances for each class that showed up in the associated node in the training set

print(tree_clf.predict_proba([[5, 1.5]]))
print(tree_clf.predict([[5, 1.5]]))

[[0.         0.90740741 0.09259259]]
[1]

The CART Training Algorithm

The CART algorithm works as follows:
- At the root node pick a feature and a threshold for that feature which splits the dataset in half
  - Pick the feature threshold combo which results in the purest subsets (weighted by their size)
- Continue recursively down the tree until some user defined stop condition is met (i.e. max_depth) or it cannot find an additional split which will reduce the impurity.
This is a Greedy algorithm in the sense that it starts with an optimal split, and continues to search for it at each level, but it doesn’t consider the impurity of lower levels when making a decision for the current level
In general the optimal tree is NP-Complete $O(exp(m))$ and intractable, in general the CART algorithm produces a reasonably good solution

Computational Complexity

It’s essential a binary search tree which is $O(log(m))$ to traverse for predictions
For training, since it compares all features on all samples at each node it is $O(n \times log(m))$

Gini Impurity or Entropy?

Entropy is another impurity measure
A set’s entropy is zero when it contains instances of only one class (which is also when gini is 0)

\[H_i = - \sum\limits_{\substack{k=1 \\ p_{i,k} \neq 0}}^n p_{i,k} log(p_{i,k})\]

Most of the time it doesn’t matter which metric you use. Gini impurity is slightly faster to compute, so it’s a good default. When they differ, Gini impurity tends to isolate the most frequent class in its own branch of the tree, while entropy tends to produce slightly more balanced trees.

Regularization Hyperparameters

Decision trees left unconstrained are liable to overfit the training data
They are known as nonparametric models because their parameters are not pre-defined such as in a linear model
To prevent overfitting, we can use regularization to restrict the degrees of freedom of the decision tree
There are several regularization parameters to pick from
Increasing the min hyperparameters and reducing the max will regularize the model
Other algorithms work by creating an unrestricted tree and then pruning unnecessary nodes
Nodes are deemed unnecessary if the children of a node are both end leaves and the p-value dictates that the chances of that split are statiscally insignifcant

Regression

Find an x value that splits the training targets in half and continue splitting until stopping condition, the target value is the mean of the target value of the samples in that node.
I think this is kind of dumb… like when are results from this sort of regression going to be better… boundaries are only orthogonal

Instability

Only orthogonal decision boundary which makes them susceptible to rotations in the training data
Can use PCA to mitigate this issue
Very sensitive to small variations in the training data
Removing one sample from the Iris dataset results in a totally different tree, even without removing samples the resultant tree might be different just because of the stochastic nature of the Scikit-learn algorithm
Random Forests can limit this instability by averaging predictions over many trees

Home

About

Projects

Posts

Chapter 6 : Decision Trees

Notes

Training and Visualizing a Decision

Making Predictions

Gini Impurity

Estimating Class Probabilities

The CART Training Algorithm

Computational Complexity

Gini Impurity or Entropy?

Regularization Hyperparameters

Regression

Instability