Machine learning algorithms/models can have many parameters and finding the best combination is a problem. Hyperparameter optimization or tuning is the problem of searching a set of optimal hyperparameters for a learning algorithm.
Grid search is a tuning technique that simply performs an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm.
In this example, we are using Ridge Regression model where alpha is a hyperparameter which denotes regularization strength(must be a positive float). Regularization improves the conditioning of the problem and reduces the variance of the estimates.
Link: scikit-learn: Ridge documentation
This recipe includes the following topics:
- Load the classification problem dataset (Pima Indians) from github
- Split columns into the usual feature columns(X) and target column(Y)
- Create a param_grid dictionary with parameters names
- Instantiate the classification algorithm: Ridge
- Instantiate the GridSearchCV class with estimator and param_grid
- Find the mean cross-validated score
- Find the (set of) parameter that achieved the best score
# import modules
import pandas as pd
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
# read data file from github
# dataframe: pimaDf
gitFileURL = 'https://raw.githubusercontent.com/andrewgurung/data-repository/master/pima-indians-diabetes.data.csv'
cols = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
pimaDf = pd.read_csv(gitFileURL, names = cols)
# convert into numpy array for scikit-learn
pimaArr = pimaDf.values
# Let's split columns into the usual feature columns(X) and target column(Y)
# Y represents the target 'class' column whose value is either '0' or '1'
X = pimaArr[:, 0:8]
Y = pimaArr[:, 8]
# create a param_grid dictionary with parameters names
alphas = np.array([1,0.1,0.01,0.001,0.0001,0])
param_grid = {'alpha': alphas}
# instantiate the classification algorithm: Ridge()
model = Ridge()
# perform a Grid Search to find the best (combination) hyperparameters
grid = GridSearchCV(estimator=model, param_grid=param_grid)
# call fit() to train the grid search using X and Y data
grid.fit(X, Y)
# Find the mean cross-validated score of the best_estimator
bestScore = grid.best_score_
# Find the (set of) parameter that achieved the best score
bestAlpha = grid.best_estimator_.alpha
print("Best Score: %.5f, Best Alpha(Hyperparameter): %f" % (bestScore, bestAlpha))
Best Score: 0.27962, Best Alpha(Hyperparameter): 1.000000