Finding Optimal n Using RFECV

tags: #python/data_science/model_selection

What is RFECV and how is it different from RFE?

RFECV (Recursive Feature Elimination with Cross-Validation) is a method that combines the feature selection of RFE with cross-validation to find the optimal number of features.

The main difference is that:

RFE - optimal features are set by the user
RFECV - uses cross-validation to find the optimal number of features recursively without being explicitly defined the user.

Performing RFECV in Python

To find the optimal features for a base model (e.g, Random Forest Classifier) using Recursive Feature Elimination (RFE), we can use the RFECV module from sklearn.feature_selection:

Import libraries

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold

Instantiate the RFC es

rfc = RandomForestClassifier(n_estimators=100, random_state=42)

Instantiate the RFECV estimator

# StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
rfecv = RFECV(estimator=rfc, step=1, cv=StratifiedKFold(5), scoring='accuracy')

# step - number of features to be removed with each iteration

In this step, the cross-validation parameter (cv) is set to StratifiedKFold(5) for 5-fold cross-validation using a stratified aproach.
Scoring is set automatically to accuracy to evaluate the performance of the model

Fit the RFECV estimator to the training data

rfecv.fit(X_train, y_train)

View results

# view optimal number of features
print(rfecv.n_features_)

# boolean array indicating which features are selected
print(rfecv.support_)

# printing features
print(X_train.columns[rfecv.support_])

Subset the original dataset to include only the optimal features

X_train_optimal_features = rfecv.transform(X_train)

Alternatively, we can subset the X_train to include only the selected features using rfecv.support_, which returns a boolean masks that indicates the features selected by RFECV:

# Assuming rfecv is the fitted RFECV object and X is the original feature matrix
X_train_selected = X_train.loc[:, rfecv.support_]

Here we are selecting all rows of X_train, for only columns of the boolean mask where it is True.

Identifying Excluded Features

# identifying features which have excluded
excluded_features = X_train.columns[~rfecv.support_]
print(excluded_features)