Calculate correlation of columns

Correlation refers to the relationship between two attributes.
– The correlation coefficient ranges from −1(full negative correlation) to 1(full positive)
– A value of 0 implies that there is no linear correlation between the columns

Note: It is recommended to remove highly correlated columns in some machine learning algorithms


This recipe includes the following topics:

  • Use the standard Pearson’s Correlation Coefficient
  • Compute pairwise correlation of columns


# import module
import pandas as pd

fileGitURL = 'https://raw.githubusercontent.com/andrewgurung/data-repository/master/pima-indians-diabetes.data.csv'

# define column names
cols = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

# load file as a Pandas DataFrame
pimaDf = pd.read_csv(fileGitURL, names=cols)

# set options
pd.set_option('precision', 3)
pd.set_option('display.width', 100)

# calculate correlation between columns
correlation = pimaDf.corr(method='pearson')
print(correlation)
        preg   plas   pres   skin   test   mass   pedi    age  class
preg   1.000  0.129  0.141 -0.082 -0.074  0.018 -0.034  0.544  0.222
plas   0.129  1.000  0.153  0.057  0.331  0.221  0.137  0.264  0.467
pres   0.141  0.153  1.000  0.207  0.089  0.282  0.041  0.240  0.065
skin  -0.082  0.057  0.207  1.000  0.437  0.393  0.184 -0.114  0.075
test  -0.074  0.331  0.089  0.437  1.000  0.198  0.185 -0.042  0.131
mass   0.018  0.221  0.282  0.393  0.198  1.000  0.141  0.036  0.293
pedi  -0.034  0.137  0.041  0.184  0.185  0.141  1.000  0.034  0.174
age    0.544  0.264  0.240 -0.114 -0.042  0.036  0.034  1.000  0.238
class  0.222  0.467  0.065  0.075  0.131  0.293  0.174  0.238  1.000

Leave a Reply

Your email address will not be published. Required fields are marked *