Feature selection: Feature Importance

Decision trees like Random Forest can be used to estimate the importance of features. The higher, the more important the feature.

This recipe includes the following topics:

  • Initialize RandomForestClassifier class
  • Call fit() to build a forest of trees from the training set (X, y)
  • Display the feature importances


# 4. Feature Importance
# import modules
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier

# read data file from github
# dataframe: pimaDf
gitFileURL = 'https://raw.githubusercontent.com/andrewgurung/data-repository/master/pima-indians-diabetes.data.csv'
cols = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
pimaDf = pd.read_csv(gitFileURL, names = cols)

# convert into numpy array
pimaArr = pimaDf.values

# Let's split our data into the usual train(X) and test/target(Y) set
X = pimaArr[:, 0:8]
Y = pimaArr[:, 8]

# initialize PCA class
# call fit() to build a forest of trees from the training set (X, y)
rfc = RandomForestClassifier().fit(X,Y)

# display rfe attributes
print("Feature importances: %s" % rfc.feature_importances_)
print('-'*60)

# The scores suggest at the importance of plas, mass, and pedi
Feature importances: [0.10203687 0.25106337 0.08872303 0.06846597 0.07482446 0.15623041
 0.13915677 0.11949911]

Leave a Reply

Your email address will not be published. Required fields are marked *