Boosting Ensemble: AdaBoost

Boosting ensemble method builds multiple models (generally of the same type) where each model learns to fix the prediction errors of the previous model in the sequence.

AdaBoost (short form of Adaptive Boosting) works by using the information gathered at each stage of the algorithm about the relative ‘hardness’ of each training sample such that later trees tend to focus on harder-to-classify examples.

This recipe includes the following topics:

  • Load classification problem dataset (Pima Indians) from github
  • Split columns into the usual feature columns(X) and target column(Y)
  • Split data using KFold() with k-fold count: 10, seed:7
  • Instantiate the boosting ensemble method: AdaBoostClassifierwith num_trees:30, and seed:7
  • Call cross_val_score() to run cross validation
  • Calculate mean estimated accuracy from scores returned by cross_val_score()

# import modules
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostClassifier

# read data file from github
# dataframe: pimaDf
gitFileURL = ''
cols = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
pimaDf = pd.read_csv(gitFileURL, names = cols)

# convert into numpy array for scikit-learn
pimaArr = pimaDf.values

# Let's split columns into the usual feature columns(X) and target column(Y)
# Y represents the target 'class' column whose value is either '0' or '1'
X = pimaArr[:, 0:8]
Y = pimaArr[:, 8]

# set k-fold count
folds = 10

# set seed to reproduce the same random data each time
seed = 7

# split data using KFold
kfold = KFold(n_splits=folds, random_state=seed)

# set total number of trees
num_trees = 30

# instantiate the boosting ensemble method: AdaBoostClassifier
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)

# call cross_val_score() to run cross validation
resultArr = cross_val_score(model, X, Y, cv=kfold)

# calculate mean of scores for all folds
meanAccuracy = resultArr.mean()

# display mean estimated accuracy
print("Mean estimated accuracy: %.5f" % meanAccuracy)
Mean estimated accuracy: 0.76046

