Random Forest with ScikitLearn


What is Random Forest?


A forest is essentially a collection of trees. In machine learning too, the "Random Forest" is an ensemble of "Decision Trees". The Random forest algorithm fixes a lot of issues we noticed in the Decision Tree.
Essentially, Random Forest comes up with several Decision Tree CART models using different initial variables and data sets. This eliminates the effort on identifying initial variables and assorting the data sets. Each of these Decision Trees "votes" for an output - and the final result is a function of these. It could be just the mean; or we can use a weighted mean based on the initial variables, or any particular mapping that could evolve based on the algorithms used.

Tuning

The efficiency and performance of the Random Forest algorithm depends upon the right choice of some important parameters.

Max Features

The maximum number of features allowed for a given Decision Tree. On the first sight, this may seem crazy that we limit the number of features used in a tree. One may feel that allowing each tree to access each feature will make them more robust and complete. But if we peep into the forest, we see another picture. We want to strengthen the forest as a whole and not individual trees. If each tree uses all the features, we would be depriving the diversity of the solution. On the other hand, having very few features is not good either. Having just one feature per tree is like having a single Decision Tree with each individual tree as a node. We want something between the two - so that each tree can respectfully contribute to the final decision.

Number of Estimators

This is another parameter that sounds crazy at first sight. We do not wait for building all trees at once before we start to consider the voting and averages. We can start off when a predefined number of trees is trained and that should be enough to move further. Higher this value, better is the performance of the forest. But that will also lead to lower performance. Typically this value should be as high as the processor can allow. But we should be ready for less than 100%.

Minimum Sample Leaf Size

This parameter is inherited from the Decision Trees. Lower the value of sample leaf size, higher is the noise captured in training - as it can lead to overfitting. And a very high value, will lead to underfitting. We have different thumb rules for this; but each comes with a disclaimer that one should try out multiple options to find out the best.

Implementations

Scikit Learn provides a simple implementation for the Random Forest. In order to implement it in Python, we start with importing the libraries
from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
Next we get the data from the iris database
iris = load_iris()
For training and testing the model, we need to divide this into the train and test sets.
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, stratify=iris.target, random_state=50)
Next we instantiate an object of the RandomForestClassifier and train it with the train set.
rfc = RandomForestClassifier()
rfc.fit(X_train, Y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
Now that we have trained the model, we can check out its performance
model.score(X_train, Y_train)
model.score(X_test, Y_test)
1.0
0.97368421052631582
This is obviously an overfit - perhaps because the min_samples_leaf is 1. We can play around with these values to get better and better results. This was a basic implementation of the Random Forest to give an idea. We can learn more as we try to tweak the parameters.