Winsorizing Outliers
tags: #data_transformation #outliers
Winsorizing is a data transformation technique that extreme values in the dataset with values at specified percentiles.
Running in Python
To Winsorize a dataset, we can use the scipy package's stats.mstats.winsorize() function that replaces outliers above and below the specified percentile, with values at the specified percentiles:
import numpy as np
from scipy.stats.mstats import winsorize
# apply Winsorization at the 10th and 90th percentiles
winsorized_data = winsorize(df['target'], limits=[0.1, 0.1])
# view the transformed data
print(winsorized_data)
The limits parameter specifies the values that should be replaced with that fall below a given percentile or above another given percentile with the values at that percentile.
In this case above, we are removing the lowest 10% and highest 10% of values. The resulting winsorized_data will have the same shape as the original data, but with the outliers replaced by the replaced with the values at the 10th and 90th percentiles.
Finding the Optimal Percentiles: IQR Method
To find the optimal limits for Winsorization, one approach is to use the interquartile range (IQR):
The IQR is the difference between the 75th and 25th percentiles of the data.
We can define the limits for outliers as:
In Python:
# Calculate the 25th and 75th percentiles
q1, q3 = np.percentile(data, [25, 75])
# atlerantive method
q1 = np.quantile(data, 0.25)
q3 = np.quantile(data, 0.75)
# Calculate the interquartile range
iqr = q3 - q1
# Calculate the upper and lower limits
upper_limit = q3 + 1.5 * iqr
lower_limit = q1 - 1.5 * iqr
# Winsorize the data using the upper and lower limits
winsorized_data = winsorize(data, limits=(lower_limit, upper_limit))