Finding Optimal n Using RFECV
tags: #python/data_science/model_selection
What is RFECV and how is it different from RFE?
RFECV (Recursive Feature Elimination with Cross-Validation) is a method that combines the feature selection of RFE with cross-validation to find the optimal number of features.
The main difference is that:
- RFE - optimal features are set by the user
- RFECV - uses cross-validation to find the optimal number of features recursively without being explicitly defined the user.
Performing RFECV in Python
To find the optimal features for a base model (e.g, Random Forest Classifier) using Recursive Feature Elimination (RFE), we can use the RFECV module from sklearn.feature_selection:
- Import libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold
- Instantiate the RFC es
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
- Instantiate the RFECV estimator
# StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
rfecv = RFECV(estimator=rfc, step=1, cv=StratifiedKFold(5), scoring='accuracy')
# step - number of features to be removed with each iteration
- In this step, the
cross-validation parameter (cv)is set toStratifiedKFold(5)for 5-fold cross-validation using a stratified aproach. - Scoring is set automatically to
accuracyto evaluate the performance of the model
- Fit the RFECV estimator to the training data
rfecv.fit(X_train, y_train)
- View results
# view optimal number of features
print(rfecv.n_features_)
# boolean array indicating which features are selected
print(rfecv.support_)
# printing features
print(X_train.columns[rfecv.support_])
- Subset the original dataset to include only the optimal features
X_train_optimal_features = rfecv.transform(X_train)
Alternatively, we can subset the X_train to include only the selected features using rfecv.support_, which returns a boolean masks that indicates the features selected by RFECV:
# Assuming rfecv is the fitted RFECV object and X is the original feature matrix
X_train_selected = X_train.loc[:, rfecv.support_]
- Here we are selecting all rows of
X_train, for only columns of the boolean mask where it isTrue.
Identifying Excluded Features
# identifying features which have excluded
excluded_features = X_train.columns[~rfecv.support_]
print(excluded_features)