Cross-validation: Random Test-Train Splits

Random Test-Train Splits perform a random permutation of splits on your dataset converting it into training and test sets.

Note: Random splits do not guarantee that all folds will be different.

Cross-validation can be done with the cross_val_score() helper function on the estimator(LogisticRegression), dataset and split technique(ShuffleSplit)

  • cross_val_score() returns scores of the estimator for each fold

This recipe includes the following topics:

  • Load data/file from github
  • Split columns into the usual feature columns(X) and target column(Y)
  • Set test size to 33%
  • Set seed to reproduce the same random data each time
  • Set total random permutation to perform: 10
  • Split data using ShuffleSplit() class
  • Instantiate a classification model (LogisticRegression)
  • Call cross_val_score() to run cross validation
  • Calculate mean and standard deviation from scores returned by cross_val_score()


# import modules
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score

# read data file from github
# dataframe: pimaDf
gitFileURL = 'https://raw.githubusercontent.com/andrewgurung/data-repository/master/pima-indians-diabetes.data.csv'
cols = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
pimaDf = pd.read_csv(gitFileURL, names = cols)

# convert into numpy array for scikit-learn
pimaArr = pimaDf.values

# Let's split columns into the usual feature columns(X) and target column(Y)
# Y represents the target 'class' column whose value is either '0' or '1'
X = pimaArr[:, 0:8]
Y = pimaArr[:, 8]

# set test size to 33%
test_size = 0.33

# set seed to reproduce the same random data each time
seed = 7

# set total random permutation to perform
n_splits = 10

# split data using LeaveOneOut
shufflesplit = ShuffleSplit(n_splits=n_splits, test_size=test_size, random_state=seed)

# instantiate a classification model
model = LogisticRegression()

# call cross_val_score() to run cross validation
resultArr = cross_val_score(model, X, Y, cv=shufflesplit)

# calculate mean of scores for all folds
meanAccuracy = resultArr.mean() * 100

# calculate standard deviation of scores for all folds
stdAccuracy = resultArr.std() * 100

# display accuracy
print("Mean accuracy: %.3f%%, Standard deviation: %.3f%%" % (meanAccuracy, stdAccuracy))
Mean accuracy: 76.496%, Standard deviation: 1.698%

Leave a Reply

Your email address will not be published. Required fields are marked *