Filter-based Method

tags: #python/data_science/preprocessing/feature_selection

What are filter-based methods?

Filter-based feature selection is a type of feature selection technique that involves selecting features based on a predefined criterion, such as correlation, statistical tests, or information-theoretic[1] measures, without involving a machine learning model.

Caveats

Important to Note!

Filter-based methods DO NOT consider relationship between features, i.e., they do not consider the relationship BETWEEN features. Therefore, they will also not account for multicollinearity and will need to be dealt with separately prior to feature selection using filter-based methods.

Techniques

1. Univariate Feature Selection: sklearn SelectkBest

Univariate feature selection works by selecting the best features based on univariate statistical tests.

This can be done with sklearn SelectKBest, which removes all but the k highest scoring features.

Note: this is only works for supervised ML algorithms.

This combines the univariate statistical test with selecting the K-number of features based on the statistical result between each input feature of a feature matrix and target variable.

Example: We can perform a χ2 to the samples to retrieve only the two best features as follows:

# divide dataset into feature matrix and a single vector containing the target
features = df[['LIST OF FEATURES']]
X = features
y = df['target']

# import necessary sklearn modules
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

X_new = SelectKBest(chi2, k=2).fit_transform(X, y) # where k is the top n number of features you want
X_new.shape

2. Univariate Feature Selection: sklearn VarianceThreshold

VarianceThreshold remove constant features that provide minimal or no information. Features that does not vary much within itself generally has very little predictive power.

  1. Import the necessary modules:
from sklearn.feature_selection import VarianceThreshold
  1. Instantiate the VarianceThreshold class and set the threshold value:
selector = VarianceThreshold(threshold=0.5)

# The `threshold` parameter specifies the minimum variance that a feature must have to be retained. In this example, any feature with a variance below 0.5 will be removed.
  1. Fit the selector to the input data
# Values cannot be NULL
X = X.fillna(0)

# fit selector to input data
X_selected = selector.fit_transform(X)

# The `fit_transform()` method applies the variance threshold selection to the input data and returns the selected features as a numpy array. Any features with a variance below the threshold value will be removed.

# print numpy array
print(X_selected)


  1. e.g., entropy ↩︎

Powered by Forestry.md