Boosting Ensemble: Stochastic Gradient Boosting

Boosting ensemble method builds multiple models (generally of the same type) where each model learns to fix the prediction errors of the previous model in the sequence.

Stochastic Gradient Boosting is a sophisticated but highly effective ensemble technique for improving performance. Gradient Boosting model is constructed using GradientBoostingClassifier class.


This recipe includes the following topics:

  • Load classification problem dataset (Pima Indians) from github
  • Split columns into the usual feature columns(X) and target column(Y)
  • Split data using KFold() with k-fold count: 10, seed:7
  • Instantiate the boosting ensemble method: GradientBoostingClassifierwith num_trees:100, and seed:7
  • Call cross_val_score() to run cross validation
  • Calculate mean estimated accuracy from scores returned by cross_val_score()


# import modules
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier

# read data file from github
# dataframe: pimaDf
gitFileURL = 'https://raw.githubusercontent.com/andrewgurung/data-repository/master/pima-indians-diabetes.data.csv'
cols = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
pimaDf = pd.read_csv(gitFileURL, names = cols)

# convert into numpy array for scikit-learn
pimaArr = pimaDf.values

# Let's split columns into the usual feature columns(X) and target column(Y)
# Y represents the target 'class' column whose value is either '0' or '1'
X = pimaArr[:, 0:8]
Y = pimaArr[:, 8]

# set k-fold count
folds = 10

# set seed to reproduce the same random data each time
seed = 7

# split data using KFold
kfold = KFold(n_splits=folds, random_state=seed)

# set total number of trees
num_trees = 100

# instantiate the boosting ensemble method: GradientBoostingClassifier
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)

# call cross_val_score() to run cross validation
resultArr = cross_val_score(model, X, Y, cv=kfold)

# calculate mean of scores for all folds
meanAccuracy = resultArr.mean()

# display mean estimated accuracy
print("Mean estimated accuracy: %.5f" % meanAccuracy)
Mean estimated accuracy: 0.76690

Leave a Reply

Your email address will not be published. Required fields are marked *