Finding Optimal n Using RFECV

tags: #python/data_science/model_selection

What is RFECV and how is it different from RFE?

RFECV (Recursive Feature Elimination with Cross-Validation) is a method that combines the feature selection of RFE with cross-validation to find the optimal number of features.

The main difference is that:

Performing RFECV in Python

To find the optimal features for a base model (e.g, Random Forest Classifier) using Recursive Feature Elimination (RFE), we can use the RFECV module from sklearn.feature_selection:

  1. Import libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold
  1. Instantiate the RFC es
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
  1. Instantiate the RFECV estimator
# StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
rfecv = RFECV(estimator=rfc, step=1, cv=StratifiedKFold(5), scoring='accuracy')

# step - number of features to be removed with each iteration
  1. Fit the RFECV estimator to the training data
rfecv.fit(X_train, y_train)
  1. View results
# view optimal number of features
print(rfecv.n_features_)

# boolean array indicating which features are selected
print(rfecv.support_)

# printing features
print(X_train.columns[rfecv.support_])
  1. Subset the original dataset to include only the optimal features
X_train_optimal_features = rfecv.transform(X_train)

Alternatively, we can subset the X_train to include only the selected features using rfecv.support_, which returns a boolean masks that indicates the features selected by RFECV:

# Assuming rfecv is the fitted RFECV object and X is the original feature matrix
X_train_selected = X_train.loc[:, rfecv.support_] 

Identifying Excluded Features

# identifying features which have excluded
excluded_features = X_train.columns[~rfecv.support_]
print(excluded_features)
Powered by Forestry.md