Preprocessing data: Rescaling using scikit-learn

Rescaling will transform data to all have the same scale.
Transformed data will lie between a given minimum and maximum value, often between zero and one.

This recipe includes the following topics:

  • Rescale using MinMaxScaler class
  • Call fit() to compute the min and max value to be used for later scaling
  • Call transform() on the input data


# import modules
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# read data file from github
# dataframe: pimaDf
gitFileURL = 'https://raw.githubusercontent.com/andrewgurung/data-repository/master/pima-indians-diabetes.data.csv'
cols = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
pimaDf = pd.read_csv(gitFileURL, names = cols)

# convert into numpy array
pimaArr = pimaDf.values

# Though we won't be using test set in this example
# Let's split our data into the usual train(X) and test(Y) set
X = pimaArr[:, 0:8]
Y = pimaArr[:, 8]

# 1. initialize MinMaxScaler class to limit output range between 0 and 1
# 2. call fit() function to compute the min and max value
scaler = MinMaxScaler(feature_range=(0,1)).fit(X)

# rescale input data using transform()
rescaledX = scaler.transform(X)

# limit precision to 3 decimal points for printing
np.set_printoptions(3)

# print first 3 rows of input data
print(X[:3,])
print('-'*60)

# print first 3 rows of output data
print(rescaledX[:3,])
[[  6.    148.     72.     35.      0.     33.6     0.627  50.   ]
 [  1.     85.     66.     29.      0.     26.6     0.351  31.   ]
 [  8.    183.     64.      0.      0.     23.3     0.672  32.   ]]
------------------------------------------------------------
[[0.353 0.744 0.59  0.354 0.    0.501 0.234 0.483]
 [0.059 0.427 0.541 0.293 0.    0.396 0.117 0.167]
 [0.471 0.92  0.525 0.    0.    0.347 0.254 0.183]]

Leave a Reply

Your email address will not be published. Required fields are marked *