Bagging Ensemble: Random Forest

Bagging ensemble method builds multiple models (generally of the same type) from different samples(with replacement) of the training dataset. Then the predictions from all the sub-models are averaged across.

Random Forest is an extension of bagged decision trees. In Random Forest, the model creates an entire forest of random uncorrelated decision trees to arrive at the best possible answer.


This recipe includes the following topics::

  • Load classification problem dataset (Pima Indians) from github
  • Split columns into the usual feature columns(X) and target column(Y)
  • Split data using KFold() with k-fold count: 10, seed:7
  • Instantiate the bagging ensemble method: RandomForestClassifier with num_trees:100, and max_features:3
  • Call cross_val_score() to run cross validation
  • Calculate mean estimated accuracy from scores returned by cross_val_score()


# import modules
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

# read data file from github
# dataframe: pimaDf
gitFileURL = 'https://raw.githubusercontent.com/andrewgurung/data-repository/master/pima-indians-diabetes.data.csv'
cols = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
pimaDf = pd.read_csv(gitFileURL, names = cols)

# convert into numpy array for scikit-learn
pimaArr = pimaDf.values

# Let's split columns into the usual feature columns(X) and target column(Y)
# Y represents the target 'class' column whose value is either '0' or '1'
X = pimaArr[:, 0:8]
Y = pimaArr[:, 8]

# set k-fold count
folds = 10

# set seed to reproduce the same random data each time
seed = 7

# split data using KFold
kfold = KFold(n_splits=folds, random_state=seed)

# set total number of trees
num_trees = 100

# set random selection of features
max_features = 3

# instantiate the bagging ensemble method: RandomForestClassifier
model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)

# call cross_val_score() to run cross validation
resultArr = cross_val_score(model, X, Y, cv=kfold)

# calculate mean of scores for all folds
meanAccuracy = resultArr.mean()

# display mean estimated accuracy
print("Mean estimated accuracy: %.5f" % meanAccuracy)
Mean estimated accuracy: 0.76950

Leave a Reply

Your email address will not be published. Required fields are marked *