Decision Tree Optimization


How to Optimize Decision Tree?


There are two important factors that drive the efficiency of a Decision Tree implementation.
  • Ensure split on the right features
  • Avoid Overfitting
The essential concept of decision tree is based on splitting the tree on the appropriate features at appropriate thresholds. But identifying these features and thresholds is very important. And overfitting, of course is the single dominant evil in any machine learning scenario. Researchers have identified several ways around these problems. Below is an introduction to the most important ones.

Optimize the Split

The first step to preparing for the Decision Tree is to identify the right set of features and split. Below are some of the important techniques to help you with that.

Gini Index

This works with categorical target variable, only Binary splits. The Gini Index is based on the concept of Purity of Population. A Population is pure if the probability of two random samples belonging to the same class is 1. You should identify the Gini index for different possible splits. The one with highest Gini Index should be the starting point.

Chi-Square

This is another way of identifying the more significant component. Chi-Square is computed from sum of squares of standardized differences between observed and expected frequencies of target variable. Specifically, the Chi-Square value is calculated as
Chi-Square = ((Actual Positive Count - Expected Positive Count)^2 / (Expected Positive Count)) ^ (1/2). 
A low value of Chi-Square suggests that the Actual and Expected counts are similar. That means, the component hardly impacts the decision - the output is statistical. High difference between the Actual and Expected counts means that the component has enough influence to drive the decision significantly away from its statistical value. This more significant component should be used to start the decision.

Information Gain

Entropy has always been the known distraction in any walk of life. Most of the processes are targeted towards minimizing the entropy. In terms of information theory, higher the entropy, lower is the information. Most of our computational effort is in an attempt to gain information - so we try to minimize the entropy. The entropy of any split is computed as
entropy = - p * (log2 q) - q * (log2 p)
Where p and q are the probability of success and failure on that node. If both are 0.5 - meaning the node is redundant - the entropy is 1. On the other hand, if both are close to 0 - meaning we have almost no false negatives or false positives, this is 'the' decision node - the entropy is very low. Surely, it should be the first choice

Over Fitting

Next, we look at the techniques for avoiding over fitting. When training a decision tree, it is possible that the tree grows huge to the extent that there is a node for each data element - or perhaps even more. In such a case, we have 100% accuracy for the training data and of course, it fails miserably outside the set. Thus, the fundamental reason for over fitting is the tree growing too big. The below techniques avoid this situation to in order to avoid over fitting. Logically, there are two ways to do this - don't let the tree grow and cut it down. That is what the below techniques do.

Constrain the Tree Size

The tree can be constrained by defining individual constraints on its growth. This is done by defining limits on the following:
Minimum samples for a node split
  1. Defines the minimum number of samples (or observations) which are required in a node to be considered for splitting.
  2. Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
  3. Too high values can lead to under-fitting hence, it should be tuned using CV.
Minimum samples for a terminal node (leaf)
  1. Defines the minimum samples (or observations) required in a terminal node or leaf.
  2. Used to control over-fitting similar to min_samples_split.
  3. Generally lower values should be chosen for imbalanced class problems because the regions in which the minority class will be in majority will be very small.
Maximum depth of tree (vertical depth)
  1. The maximum depth of a tree.
  2. Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.
  3. Should be tuned using CV.
Maximum number of terminal nodes
  1. The maximum number of terminal nodes or leaves in a tree.
  2. Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.
Maximum features to consider for split
  1. The number of features to consider while searching for a best split. These will be randomly selected.
  2. As a thumb-rule, square root of the total number of features works great but we should check upto 30-40% of the total number of features.
  3. Higher values can lead to over-fitting but depends on case to case.

Tree Pruning

Constraining the tree works reactively - while training. That means, the tree is constrained without checking up the full data. This may be quicker, but mathematically, this is not the best way. Pruning on the other hand lets the tree grow really huge. Then tries to reduce the tree by pruning low performing leaves - bottom up. Thus, the process is a lot more consistent and results in a much better model.

Bias & Variance

Imperfection manifests in any model as Bias and Variance. These are two metrics that show how good or bad is your model. In simple terms, Bias tells us if the entire set is offset from what it is meant to be. On the other hand, Variance tells us how good or bad is the compliance within the data set. It is possible that the average of the entire set matches what it is supposed to be. But, that is not enough if individual points do not match. This is defined by the variance. On the other hand, it is possible that none of the points match, but all of them are shifted by a constant value. This would be denoted by high bias and low variance. A small tree has high Bias and low Variance. On the other hand, a big tree tends to produce low Bias but high Variance. Naturally, we would like to find the minimum value in this U shaped error curve.
Bagging is one of the techniques used to reduce the variance of our predictions. Essentially it combines multiple classifier models trained on different sub-samples of the given data set. The training data set is sampled multiple times to get multiple subsets. Once this is available, each is used to train a model independently. The models are then merged to get the final model. Thus, the steps for bagging are:

Create Multiple DataSets:

  1. Sampling is done with replacement on the original data and new datasets are formed.
  2. The new data sets can have a fraction of the columns as well as rows, which are generally hyper-parameters in a bagging model
  3. Taking row and column fractions less than 1 helps in making robust models, less prone to overfitting

Build Multiple Classifiers:

  1. Classifiers are built on each data set.
  2. Generally the same classifier is modeled on each data set and predictions are made.

Combine Classifiers:

  1. The predictions of all the classifiers are combined using a mean, median or mode value depending on the problem at hand.
  2. The combined values are generally more robust than a single model.
There are many different implementations of this concept of Bagging. Random Forest is one of them