Filter-based Method
tags: #python/data_science/preprocessing/feature_selection
What are filter-based methods?
Filter-based feature selection is a type of feature selection technique that involves selecting features based on a predefined criterion, such as correlation, statistical tests, or information-theoretic[1] measures, without involving a machine learning model.
Caveats
Filter-based methods DO NOT consider relationship between features, i.e., they do not consider the relationship BETWEEN features. Therefore, they will also not account for multicollinearity and will need to be dealt with separately prior to feature selection using filter-based methods.
Techniques
1. Univariate Feature Selection: sklearn SelectkBest
Univariate feature selection works by selecting the best features based on univariate statistical tests.
This can be done with sklearn SelectKBest, which removes all but the k highest scoring features.
Note: this is only works for supervised ML algorithms.
Because works by evaluating the relationship between each input feature and the target variable, and selecting the k best features based on a scoring function.
This combines the univariate statistical test with selecting the K-number of features based on the statistical result between each input feature of a feature matrix and target variable.
Example: We can perform a
# divide dataset into feature matrix and a single vector containing the target
features = df[['LIST OF FEATURES']]
X = features
y = df['target']
# import necessary sklearn modules
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
X_new = SelectKBest(chi2, k=2).fit_transform(X, y) # where k is the top n number of features you want
X_new.shape
X_newcontains only the two best features selected by thescore.
2. Univariate Feature Selection: sklearn VarianceThreshold
VarianceThreshold remove constant features that provide minimal or no information. Features that does not vary much within itself generally has very little predictive power.
Constant Features show similar/single values in all the observations in the dataset.
These features provide no information that allows ML models to predict the target.
- High Variance in predictors: Good Indication
- Low Variance predictors: Not good for model
- Features are numerical in nature
- If working with categorical, covert into numerical representation. This can be done through Ordinal Encoding
- Normalize data to scale
- Import the necessary modules:
from sklearn.feature_selection import VarianceThreshold
- Instantiate the
VarianceThresholdclass and set the threshold value:
selector = VarianceThreshold(threshold=0.5)
# The `threshold` parameter specifies the minimum variance that a feature must have to be retained. In this example, any feature with a variance below 0.5 will be removed.
- Fit the selector to the input data
# Values cannot be NULL
X = X.fillna(0)
# fit selector to input data
X_selected = selector.fit_transform(X)
# The `fit_transform()` method applies the variance threshold selection to the input data and returns the selected features as a numpy array. Any features with a variance below the threshold value will be removed.
# print numpy array
print(X_selected)
e.g., entropy ↩︎