Wrapper-Methods
tags: #python/data_science/preprocessing/feature_selection
Wrapper-based methods select features by training a machine learning model with different subsets of features and evaluating the performance of the model on a validation set.
This is an iterative process. After each iteration:
- If the performance of the model is below a certain threshold, features are removed from the subset.
- If the performance is above the threshold, features are added to the subset.
The process is repeated until a satisfactory subset of features is found.
Techniques
Note: The number of features to be selected is a hyperparameter that needs to be chosen beforehand.
Forward Selection
Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.
# install mlxtend library to use sequential feature selector API
pip install mlxtend
#importing the necessary libraries
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
# Sequential Forward Selection(sfs)
sfs = SFS(LinearRegression(), # base model for feature selection FOR REGRESSION
k_features=11, # number of features to be selected;
forward=True, # False for backward
floating=False, # feature selection method
scoring = 'r2', # in this case, regression can be evaluated using R2
cv = 0)
Backward Elimination
In backward elimination, we start with all the features and removes the least significant feature at each iteration which improves the performance of the model. We repeat this until no improvement is observed on removal of features.
# install mlxtend library to use sequential feature selector API
pip install mlxtend
#importing the necessary libraries
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
# Sequential Forward Selection(sfs)
sfs = SFS(LinearRegression(), # base model for selecting feature FOR REGRESSION
k_features=11, # number of features to be selected;
forward=False, # True for forward
floating=False, # feature selection method
scoring = 'r2', # in this case, regression can be evaluated using R2
cv = 0)
Recursive Feature Elimination
- See also: Finding Optimal n Using RFECV
This is a greedy optimization algorithm.
RFE works by recursively removing features from the original dataset and building a model using the remaining features. The goal of RFE is to identify the most important features in the dataset by repeatedly removing the least important features until the desired number of features is reached.
In Python, you can use the RFE class from the sklearn.feature_selection module to perform RFE.
Example with logistic regression as the base model:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
Separate df into a feature matrix and target array:
features = df[['LIST OF FEATURES']]
X = features
y = df['target']
Implement RFE:
model = LogisticRegression() # create instance of the base model
rfe = rfe(model, n_features_to_select = VALUE) # where value is the number of IMPT features to select
Sample output:
Number of Features: 3
Selected Features: [ True False False False False True True False]
Feature Ranking: [1 2 3 5 6 1 1 4]
The top n number of features will be marked as 1 in the output.