Winsorizing Outliers

tags: #data_transformation #outliers

Winsorizing is a data transformation technique that extreme values in the dataset with values at specified percentiles.

Running in Python

To Winsorize a dataset, we can use the scipy package's stats.mstats.winsorize() function that replaces outliers above and below the specified percentile, with values at the specified percentiles:

import numpy as np
from scipy.stats.mstats import winsorize

# apply Winsorization at the 10th and 90th percentiles
winsorized_data = winsorize(df['target'], limits=[0.1, 0.1])

# view the transformed data
print(winsorized_data)

The limits parameter specifies the values that should be replaced with that fall below a given percentile or above another given percentile with the values at that percentile.

In this case above, we are removing the lowest 10% and highest 10% of values. The resulting winsorized_data will have the same shape as the original data, but with the outliers replaced by the replaced with the values at the 10th and 90th percentiles.


Finding the Optimal Percentiles: IQR Method

To find the optimal limits for Winsorization, one approach is to use the interquartile range (IQR):

IQR=Q3Q1

The IQR is the difference between the 75th and 25th percentiles of the data.

We can define the limits for outliers as:

Upper limit=Q3+1.5IQRLower limit=Q11.5IQR

In Python:

# Calculate the 25th and 75th percentiles 
q1, q3 = np.percentile(data, [25, 75]) 

# atlerantive method
q1 = np.quantile(data, 0.25)
q3 = np.quantile(data, 0.75)

# Calculate the interquartile range 
iqr = q3 - q1 

# Calculate the upper and lower limits 
upper_limit = q3 + 1.5 * iqr 
lower_limit = q1 - 1.5 * iqr 

# Winsorize the data using the upper and lower limits 
winsorized_data = winsorize(data, limits=(lower_limit, upper_limit))
Powered by Forestry.md