Transforming Skewed Data

tags: #preprocessing/normality

Website: TowardsDataScience

Many statistical analysis and machine learning algorithm require the data to be normally distributed.

This involves applying a mathematical function to the data to convert the original data into a more useful format for analysis or modelling (e.g., ANOVAs, Regression).

There are many techniques; however, the choice of transformation will depend on the specific nature of the data and the goals of the analysis.


1. Logarithmic Transformation

The logarithmic transformation is a useful technique for achieving normality in a dataset, particularly when the data is moderately skewed and does not have extreme outliers.

This helps to compress the large values and spread out the small ones, resulting in a more normal distribution.

Python Code:
import numpy as np

# regular logarithmic function
log_target = np.log(df["Target"])

# to handle smaller numbers
log_target = np.log1p(df["Target"])

2. Square Root Transformation

This method takes the square root of each value in the data set.

The main advantage of square root transformation is, it can be applied to zero values, but has no effect on extreme values or outliers in the data.

To apply this in the context of a DataFrame in Python:

# Method 1
sqrt_target = df["Target"]**(1/2)

# Method 2
import numpy as np 
df['sqrt_Target'] = np.sqrt(df['Target'])
Negative Values

The square-root transformation is often used for data that contains only positive values, as the square root of a negative number is undefined.


3. Box-cox Transformation

The Box-Cox transformation is a parametric transformation technique that can be used to normalize the distribution of a continuous variable.

However,, when the data contains values close to 0, the Box-Cox transformation may not be suitable and an alternative transformation such as the Yeo-Johnson transformation may be more appropriate (also more robust in handling negative values):

from scipy.stats import yeojohnson

# apply Yeo-Johnson transformation
df["transformed_target"], lam = yeojohnson(df["target"])

# to not return lambda value
df["transformed_target"],  = yeojohnson(df["target"])

Boxcox involves raising the data to a power that is determined by a parameter, lambda. The transformation can be defined as follows:

y(λ)={(yλ1)λ, if λ0log(y), if λ=0}

The method is particularly useful when the data has:

Python Code:

To find the optimal lambda value, we can use the maximum likelihood estimation (MLE) with boxcox_normax[1]:

from scipy.stats import boxcox, boxcox_normmax

# Compute optimal lambda
opt_lambda = boxcox_normmax(df["Target"] + 1)
# +1 if data have non-zero values

Once we have found the optimal lambda value, we can set the lmbda parameter in the boxcox function:

# Transform data using Box-Cox
df["boxcox_target"] = boxcox(df["Target"] + 1, lmbda=opt_lambda) 


  1. The scipy library in Python provides the boxcox_normmax function, which computes the optimal value of lambda for a given dataset based on MLE. This function returns the maximum log-likelihood function for the given data over a range of lambda values, and the optimal lambda is the value that maximizes the log-likelihood. ↩︎

Powered by Forestry.md