Filter-based Method

tags: #python/data_science/preprocessing/feature_selection

What are filter-based methods?

Filter-based feature selection is a type of feature selection technique that involves selecting features based on a predefined criterion, such as correlation, statistical tests, or information-theoretic^[1] measures, without involving a machine learning model.

Caveats

Important to Note!

Filter-based methods DO NOT consider relationship between features, i.e., they do not consider the relationship BETWEEN features. Therefore, they will also not account for multicollinearity and will need to be dealt with separately prior to feature selection using filter-based methods.

Techniques

1. Univariate Feature Selection: sklearn `SelectkBest`

Univariate feature selection works by selecting the best features based on univariate statistical tests.

This can be done with sklearn SelectKBest, which removes all but the k highest scoring features.

Note: this is only works for supervised ML algorithms.

Why only Supervised ML?

Because works by evaluating the relationship between each input feature and the target variable, and selecting the k best features based on a scoring function.

This combines the univariate statistical test with selecting the K-number of features based on the statistical result between each input feature of a feature matrix and target variable.

Example: We can perform a $χ^{2}$ to the samples to retrieve only the two best features as follows:

# divide dataset into feature matrix and a single vector containing the target
features = df[['LIST OF FEATURES']]
X = features
y = df['target']

# import necessary sklearn modules
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

X_new = SelectKBest(chi2, k=2).fit_transform(X, y) # where k is the top n number of features you want
X_new.shape

X_new contains only the two best features selected by the $χ^{2}$ score.

2. Univariate Feature Selection: sklearn `VarianceThreshold`

VarianceThreshold remove constant features that provide minimal or no information. Features that does not vary much within itself generally has very little predictive power.

Constant Features

Constant Features show similar/single values in all the observations in the dataset.

These features provide no information that allows ML models to predict the target.

High Variance in predictors: Good Indication
Low Variance predictors: Not good for model

When is it applicable?

Features are numerical in nature
If working with categorical, covert into numerical representation. This can be done through Ordinal Encoding
Normalize data to scale

Import the necessary modules:

from sklearn.feature_selection import VarianceThreshold

Instantiate the VarianceThreshold class and set the threshold value:

selector = VarianceThreshold(threshold=0.5)

# The `threshold` parameter specifies the minimum variance that a feature must have to be retained. In this example, any feature with a variance below 0.5 will be removed.

Fit the selector to the input data

# Values cannot be NULL
X = X.fillna(0)

# fit selector to input data
X_selected = selector.fit_transform(X)

# The `fit_transform()` method applies the variance threshold selection to the input data and returns the selected features as a numpy array. Any features with a variance below the threshold value will be removed.

# print numpy array
print(X_selected)

e.g., entropy ↩︎