Preprocessing data: Normalization using scikit-learn

Each row of the data matrix with at least one non zero component is rescaled independently of other samples so that its norm (l1 or l2) equals one.

This recipe includes the following topics:

  • Normalize using Normalizer class
  • Call fit() which does nothing
  • Call transform() on the input data which scales each non zero row of X to unit norm


# import modules
import pandas as pd
import numpy as np
from sklearn.preprocessing import Normalizer

# read data file from github
# dataframe: pimaDf
gitFileURL = 'https://raw.githubusercontent.com/andrewgurung/data-repository/master/pima-indians-diabetes.data.csv'
cols = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
pimaDf = pd.read_csv(gitFileURL, names = cols)

# convert into numpy array
pimaArr = pimaDf.values

# Though we won't be using the test set in this example
# Let's split our data into the usual train(X) and test(Y) set
X = pimaArr[:, 0:8]
Y = pimaArr[:, 8]

# 1. initiate Normalizer class
# 2. call fit() which does nothing in case of Normalizer
scaler = Normalizer().fit(X)

# standarize input data using transform()
normalizedX = scaler.transform(X)

# limit precision to 3 decimal points for printing
np.set_printoptions(3)

# print first 3 rows of input data
print(X[:3,])
print('-'*60)

# print first 3 rows of output data
print(normalizedX[:3,])
[[  6.    148.     72.     35.      0.     33.6     0.627  50.   ]
 [  1.     85.     66.     29.      0.     26.6     0.351  31.   ]
 [  8.    183.     64.      0.      0.     23.3     0.672  32.   ]]
------------------------------------------------------------
[[0.034 0.828 0.403 0.196 0.    0.188 0.004 0.28 ]
 [0.008 0.716 0.556 0.244 0.    0.224 0.003 0.261]
 [0.04  0.924 0.323 0.    0.    0.118 0.003 0.162]]

Leave a Reply

Your email address will not be published. Required fields are marked *