Bagging Ensemble: Bagged Decision Trees

Bagging ensemble method builds multiple models (generally of the same type) from different samples(with replacement) of the training dataset. Then the predictions from all the sub-models are averaged across.

Bagged Decision Trees is applied using BaggingClassifier with the Classification and Regression Trees(CART) algorithm (DecisionTreeClassifier). 100 trees are created in total.


This recipe includes the following topics:

  • Load classification problem dataset (Pima Indians) from github
  • Split columns into the usual feature columns(X) and target column(Y)
  • Split data using KFold() with k-fold count: 10, seed:7
  • Instantiate the CART algorithm: DecisionTreeClassifier
  • Instantiate the bagging ensemble method: BaggingClassifier with CART, num_trees:100, and seed:7
  • Call cross_val_score() to run cross validation
  • Calculate mean estimated accuracy from scores returned by cross_val_score()


# import modules
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

# read data file from github
# dataframe: pimaDf
gitFileURL = 'https://raw.githubusercontent.com/andrewgurung/data-repository/master/pima-indians-diabetes.data.csv'
cols = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
pimaDf = pd.read_csv(gitFileURL, names = cols)

# convert into numpy array for scikit-learn
pimaArr = pimaDf.values

# Let's split columns into the usual feature columns(X) and target column(Y)
# Y represents the target 'class' column whose value is either '0' or '1'
X = pimaArr[:, 0:8]
Y = pimaArr[:, 8]

# set k-fold count
folds = 10

# set seed to reproduce the same random data each time
seed = 7

# split data using KFold
kfold = KFold(n_splits=folds, random_state=seed)

# instantiate the CART algorithm
cart = DecisionTreeClassifier()

# set total number of trees
num_trees = 100

# instantiate the bagging ensemble method: BaggingClassifier
# with CART, num_trees:100, and seed:7
model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=seed)

# call cross_val_score() to run cross validation
resultArr = cross_val_score(model, X, Y, cv=kfold)

# calculate mean of scores for all folds
meanAccuracy = resultArr.mean()

# display mean estimated accuracy
print("Mean estimated accuracy: %.5f" % meanAccuracy)
Mean estimated accuracy: 0.77075

Leave a Reply

Your email address will not be published. Required fields are marked *