Multicolinearity

What is multicollinearity?

Multicollinearity occurs when two or more predictors (IDV) in a multiple regression model are highly linearly correlated.

This affects estimates of regression coefficients are covariates.

As a result, estimates of coefficient become unreliable (note: in cases of extreme multicollinearity, you get an error).

Why is this important?

One of the assumptions for _About Linear Regression and Binary Logistic Regression models is that features are INDEPENDENT of each other. Therefore, presence of multicollinearity VIOLATES the independence assumption because this shows that features are dependent of each other.

Detecting multicollinearity

1. Using correlation coefficient heat map

We can observe the correlation coefficient in a matrix and exclude columns where there is high correlation.

We can do this by using the df.corr() function (note: this will only compute the correlation for NUMERIC features), and displaying the matrix in a heatmap using seaborn:

sns.heatmap(df.corr(), annot=True)

2. Using Variance Influence Factor (VIF)

This is a measure of multicollinearity.

Threshold: $V I F > 2.5$ (note: this corresponds to a $R^{2}$ - if you regress IDV to each other - of 0.6)

#Imports
import pandas as pd
import numpy as np
from patsy import dmatrices
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

df = pd.read_csv('CSV FILE')
df.dropna()
df = df._get_numeric_data() #drop non-numeric cols

df.head()

To calculate the VIF for each explanatory variable in the model, we can use the variance_inflation_factor() function from the statsmodels library:

from patsy import dmatrices
from statsmodels.stats.outliers_influence import variance_inflation_factor

#find design matrix for linear regression model using 'DV' as response variable 
y, X = dmatrices('DV ~ IDV+IDV2+IDV3', data=df, return_type='dataframe')

#calculate VIF for each explanatory variable
vif = pd.DataFrame()
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['variable'] = X.columns

#view VIF for each explanatory variable 
vif