Wrapper-Methods

tags: #python/data_science/preprocessing/feature_selection

Wrapper-based methods select features by training a machine learning model with different subsets of features and evaluating the performance of the model on a validation set.

This is an iterative process. After each iteration:

If the performance of the model is below a certain threshold, features are removed from the subset.
If the performance is above the threshold, features are added to the subset.

The process is repeated until a satisfactory subset of features is found.

Techniques

Note: The number of features to be selected is a hyperparameter that needs to be chosen beforehand.

Forward Selection

Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.

# install mlxtend library to use sequential feature selector API
pip install mlxtend

#importing the necessary libraries
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression

# Sequential Forward Selection(sfs)
sfs = SFS(LinearRegression(), # base model for feature selection FOR REGRESSION
          k_features=11, # number of features to be selected; 
          forward=True, # False for backward
          floating=False, # feature selection method
          scoring = 'r2', # in this case, regression can be evaluated using R2
          cv = 0)

Backward Elimination

In backward elimination, we start with all the features and removes the least significant feature at each iteration which improves the performance of the model. We repeat this until no improvement is observed on removal of features.

# install mlxtend library to use sequential feature selector API
pip install mlxtend

#importing the necessary libraries
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression

# Sequential Forward Selection(sfs)
sfs = SFS(LinearRegression(), # base model for selecting feature FOR REGRESSION
          k_features=11, # number of features to be selected; 
          forward=False, # True for forward
          floating=False, # feature selection method
          scoring = 'r2', # in this case, regression can be evaluated using R2
          cv = 0)

Recursive Feature Elimination

See also: Finding Optimal n Using RFECV

This is a greedy optimization algorithm.

RFE works by recursively removing features from the original dataset and building a model using the remaining features. The goal of RFE is to identify the most important features in the dataset by repeatedly removing the least important features until the desired number of features is reached.

In Python, you can use the RFE class from the sklearn.feature_selection module to perform RFE.

Example with logistic regression as the base model:

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

Separate df into a feature matrix and target array:

features = df[['LIST OF FEATURES']]
X = features
y = df['target']

Implement RFE:

model = LogisticRegression() # create instance of the base model
rfe = rfe(model, n_features_to_select = VALUE) # where value is the number of IMPT features to select

Sample output:

Number of Features: 3
Selected Features: [ True False False False False True True False]
Feature Ranking: [1 2 3 5 6 1 1 4]

The top n number of features will be marked as 1 in the output.