Regression Model evaluation: R2

There are 3 different APIs for model evaluation:
1. Estimator score method: Estimator/model object has a ‘score()’ method that provides a default evaluation
2. Scoring parameter: Predefined scoring parameter that can be passed into cross_val_score() method
3. Metric function: Functions defined in the metrics module

R2 is an example of Scoring parameter API.
It is a statistical measure of how close the data are to the fitted regression line.
0: No-fit
1: Perfect fit

Note:
– For the regression problem, we will use the Boston house price dataset.
– Estimator/Algorithm: Linear Regression
– Cross-Validation Split: K-Fold (k=10)


This recipe includes the following topics:

  • Load data/file from github
  • Split columns into the usual feature columns(X) and target column(Y)
  • Set k-fold count to 10
  • Set seed to reproduce the same random data each time
  • Split data using KFold() class
  • Instantiate a regression model (LinearRegression)
  • Set scoring parameter to ‘r2’
  • Call cross_val_score() to run cross validation
  • Calculate mean and standard deviation from scores returned by cross_val_score()


# import modules
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# read data file from github
# dataframe: houseDf
gitFileURL = 'https://raw.githubusercontent.com/andrewgurung/data-repository/master/housing.csv'
cols = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
houseDf = pd.read_csv(gitFileURL, delim_whitespace=True, names = cols)

# convert into numpy array for scikit-learn
houseArr = houseDf.values

# Let's split columns into the usual feature columns(X) and target column(Y)
# Y represents the target 'MEDV' column
X = houseArr[:, 0:13]
Y = houseArr[:, 13]

# set k-fold count
folds = 10

# set seed to reproduce the same random data each time
seed = 7

# split data using KFold
kfold = KFold(n_splits=folds, random_state=seed)

# instantiate a regression model
model = LinearRegression()

# set scoring parameter to 'r2'
scoring = 'r2'

# call cross_val_score() to run cross validation
resultArr = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)

# calculate mean of scores for all folds
r2 = resultArr.mean()

# calculate standard deviation
stdAccuracy = resultArr.std()

# display r2 score
# descending score(smallest score is best) is denoted by negative even though the value is positive
print("Mean Absolute Error: %.3f, Standard Deviation : %.3f" % (r2, stdAccuracy))
Mean Absolute Error: 0.203, Standard Deviation : 0.595

Leave a Reply

Your email address will not be published. Required fields are marked *