Pipeline: Feature Selection and Modeling

Feature Selection and modeling is a standard workflow in machine learning. Scikit-learn provides a Pipeline utility to automate this process. Pipelines help prevent data leakage. Pipeline itself is treated like a merged estimator/algorithm.

Data leakage: Running feature selection on the entire test dataset before evaluating a model/algorithm will highly influence the result. Feature selection should be constrained to each fold of your cross-validation during model evaluation.

FeatureUnion: A handy tool provided by Pipeline to combine multiple features.

In this example, we are working with:
Classification problem: Pima Indians Diabetes dataset
Feature Selection: Univariate Selection + Principal Component Analysis
Modeling: Logistic Regression (classification algorithm)

This recipe includes the following topics:

  • Load classification problem dataset (Pima Indians diabetes) from github
  • Split columns into the usual feature columns(X) and target column(Y)
  • Feature Selection #1: Univariate Selection using SelectKBest where best features selection = 6
  • Feature Selection #2: Principal Component Analysis with number of components = 3
  • Combine both features using FeatureUnion
  • Modeling: Logistic Regression (Classification Algorithm)
  • Create a pipeline of feature selection and modeling
  • Split data using KFold() class with K-fold count:10 and seed: 7
  • Evaluate the pipeline by calling cross_val_score() to run cross validation
  • Calculate mean from scores returned by cross_val_score()

# import modules
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectKBest
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion

# read data file from github
# dataframe: pimaDf
gitFileURL = 'https://raw.githubusercontent.com/andrewgurung/data-repository/master/pima-indians-diabetes.data.csv'
cols = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
pimaDf = pd.read_csv(gitFileURL, names = cols)

# convert into numpy array for scikit-learn
pimaArr = pimaDf.values

# Let's split columns into the usual feature columns(X) and target column(Y)
# Y represents the target 'class' column whose value is either '0' or '1'
X = pimaArr[:, 0:8]
Y = pimaArr[:, 8]

# set k-fold count
folds = 10

# set seed to reproduce the same random data each time
seed = 7

# initialize Feature #1: SelectKBest class
uniSelector = SelectKBest(k=6)

# initialize Feature #2: PCA class
pca = PCA(n_components=3)

# combine both features using FeatureUnion
features = []
features.append(('select_best', uniSelector))
features.append(('pca', pca))
feature_union = FeatureUnion(features)

# initialize LinearDiscriminantAnalysis class
lrModel = LogisticRegression()

# create pipeline with data preprocessing and model
estimators = []
estimators.append(('feature_union', feature_union))
estimators.append(('logistic', lrModel))
model = Pipeline(estimators)

# split data using KFold
kfold = KFold(n_splits=folds, random_state=seed)

# Evaluate the pipeline by calling cross_val_score() to run cross validation
resultArr = cross_val_score(model, X, Y, cv=kfold)

# calculate mean of scores for all folds
meanAccuracy = resultArr.mean()

# display accuracy
print("Mean accuracy: %.5f" % meanAccuracy)
Mean accuracy: 0.77604

Leave a Reply

Your email address will not be published. Required fields are marked *