Transforming Skewed Data

Many statistical analysis and machine learning algorithm require the data to be normally distributed.

When data is skewed, the normality assumption is violated.
To circumvent this, we can transform the data to better fit a Normal Distribution.

This involves applying a mathematical function to the data to convert the original data into a more useful format for analysis or modelling (e.g., ANOVAs, Regression).

There are many techniques; however, the choice of transformation will depend on the specific nature of the data and the goals of the analysis.

1. Logarithmic Transformation

The logarithmic transformation is a useful technique for achieving normality in a dataset, particularly when the data is moderately skewed and does not have extreme outliers.

This helps to compress the large values and spread out the small ones, resulting in a more normal distribution.

Python Code:

import numpy as np

# regular logarithmic function
log_target = np.log(df["Target"])

# to handle smaller numbers
log_target = np.log1p(df["Target"])

What is the difference between the two functions?

The np.log and np.log1p functions are similar, but np.log1p is preferred when dealing with very small input values to avoid returning negative infinity.

The np.log1p function computes the natural logarithm of 1 plus the input number, rather than the input number itself. This ensures that the input number is always greater than 1.
With np.log, if the input number is very small (i.e., close to zero), the result can be negative infinity. The natural log of zero is undefined, and as the input number approaches zero, the result approaches negative infinity.

2. Square Root Transformation

This method takes the square root of each value in the data set.

The main advantage of square root transformation is, it can be applied to zero values, but has no effect on extreme values or outliers in the data.

To apply this in the context of a DataFrame in Python:

# Method 1
sqrt_target = df["Target"]**(1/2)

# Method 2
import numpy as np 
df['sqrt_Target'] = np.sqrt(df['Target'])

Negative Values

The square-root transformation is often used for data that contains only positive values, as the square root of a negative number is undefined.

3. Box-cox Transformation

The Box-Cox transformation is a parametric transformation technique that can be used to normalize the distribution of a continuous variable.

However,, when the data contains values close to 0, the Box-Cox transformation may not be suitable and an alternative transformation such as the Yeo-Johnson transformation may be more appropriate (also more robust in handling negative values):

from scipy.stats import yeojohnson

# apply Yeo-Johnson transformation
df["transformed_target"], lam = yeojohnson(df["target"])

# to not return lambda value
df["transformed_target"],  = yeojohnson(df["target"])

Boxcox involves raising the data to a power that is determined by a parameter, lambda. The transformation can be defined as follows:

y (λ) = {\begin{matrix} \frac{(y^{λ} - 1)}{λ}, if λ \neq 0 \\ l o g (y), if λ = 0 \end{matrix}}

The method is particularly useful when the data has:

unequal variances, or when the assumptions of normality and homogeneity of variance are not met.
Designed for non-negative responses, but can be applied to data that have occasional zero or negative values by adding a constant α to the response before applying the power transformation

Python Code:

To find the optimal lambda value, we can use the maximum likelihood estimation (MLE) with boxcox_normax^[1]:

from scipy.stats import boxcox, boxcox_normmax

# Compute optimal lambda
opt_lambda = boxcox_normmax(df["Target"] + 1)
# +1 if data have non-zero values

Once we have found the optimal lambda value, we can set the lmbda parameter in the boxcox function:

# Transform data using Box-Cox
df["boxcox_target"] = boxcox(df["Target"] + 1, lmbda=opt_lambda)

The scipy library in Python provides the boxcox_normmax function, which computes the optimal value of lambda for a given dataset based on MLE. This function returns the maximum log-likelihood function for the given data over a range of lambda values, and the optimal lambda is the value that maximizes the log-likelihood. ↩︎