Normalizing or Standardizing Features

tags: #python/data_science/preprocessing

What is it and why should we standardize features?

When working with continuous variables, it is best practice to standardize variables.

This is because variables that are measured at different scales do not contribute equally to the model fitting. This can lead to bias in the algorithm.

To deal with this potential problem feature-wise standardized (μ=0, σ=1) is usually used prior to model fitting.

To do that using scikit-learn, we first need to construct an input feature matrix X, containing the features and samples with X.shape being[number_of_samples, number_of_features] .

How do we standardize the feature matrix?

The main idea is to normalize/standardize i.e. μ = 0 and σ = 1 your features/variables/columns of X, individually, before applying any machine learning model.

Thus, StandardScaler() will normalize the features i.e. each column of X, INDIVIDUALLY so that each column/feature/variable will have μ = 0 and σ = 1.

Import library

from sklearn.preprocessing import StandardScaler

Separate feature matrix from target

features = df[['LIST OF FEATURE COLS']]
X = features
y = df['TARGET COL']

Standardize Features

# instantiate the object 
scaler = StandardScaler()

# fit and transform the data  
X_standardized = scaler.fit_transform(X)

Alternatively, we can also standardize after partitioning the dataset, but will need to apply the function to both the test and training X dataset:

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

We can verify that we have successfully standardize the features by ensuring that the the mean of each column = 0, and the STD = 1:

# mean
scaled_data.mean(axis = 0) # should = 0

# standard deviation
scaled_data.std(axis = 0) # should = 1

Can we standardize specific columns?

Yes, to standardize specific features, we can use the following process:

#list for cols to scale
cols_to_scale = list(X.select_dtypes('int64'))

#create and fit scaler
scaler = StandardScaler()
scaler.fit(X[cols_to_scale])

#scale selected data
X[cols_to_scale] = scaler.transform(X[cols_to_scale])