Preprocessing data: Binarization using scikit-learn

Values greater than the threshold map to 1, while values less than or equal to the threshold map to 0. This is known as binarizing or thresholding.

This recipe includes the following topics:

  • Binarize using Binarizer class with threshold value of 0.0
  • Call fit() (Does nothing in this case)
  • Call transform() on the input data


# import modules
import pandas as pd
import numpy as np
from sklearn.preprocessing import Binarizer

# read data file from github
# dataframe: pimaDf
gitFileURL = 'https://raw.githubusercontent.com/andrewgurung/data-repository/master/pima-indians-diabetes.data.csv'
cols = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
pimaDf = pd.read_csv(gitFileURL, names = cols)

# convert into numpy array
pimaArr = pimaDf.values

# Though we won't be using the test set in this example
# Let's split our data into the usual train(X) and test(Y) set
X = pimaArr[:, 0:8]
Y = pimaArr[:, 8]

# 1. initiate Binarizer class with threshold value of 0
# 2. call fit(): does nothing in this case
binarizer = Binarizer(threshold=0.0).fit(X)

# standarize input data using transform()
binaryX = binarizer.transform(X)

# limit precision to 3 decimal points for printing
np.set_printoptions(3)

# print first 3 rows of input data
print(X[:3,])
print('-'*60)

# print first 3 rows of output data
print(binaryX[:3,])
[[  6.    148.     72.     35.      0.     33.6     0.627  50.   ]
 [  1.     85.     66.     29.      0.     26.6     0.351  31.   ]
 [  8.    183.     64.      0.      0.     23.3     0.672  32.   ]]
------------------------------------------------------------
[[1. 1. 1. 1. 0. 1. 1. 1.]
 [1. 1. 1. 1. 0. 1. 1. 1.]
 [1. 1. 1. 0. 0. 1. 1. 1.]]

Leave a Reply

Your email address will not be published. Required fields are marked *