Running Linear Regression

tags: #ML/supervised/regression

How to Run a Regression Model

To run a linear regression model in statsmodel.api:

import statsmodels.api as sm
import pandas as pd
import numpy as np

# read dataset
dataset = pd.read_csv('FILENAME.csv')

# split data into a matrix of feature and target vector 
feature_cols = ["LIST OF COLNAMES"] 
X = df[feature_cols] # Features 
X = sm.add_constant(X) # to account for intercept
y = df[target_var] # Target variable


# split into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size, random_state) 

# create instance of the OLS model 
model = sm.OLS(y_train, X_train).fit()

# fit model 
result = model.fit()

# summary
result.summary()

The result.summary() method provides a detailed summary of the regression analysis including:

                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.891
Model:                            OLS   Adj. R-squared:                  0.889
Method:                 Least Squares   F-statistic:                     403.5
Date:                Mon, 10 Jan 2024   Prob (F-statistic):           1.52e-24
Time:                        12:34:56   Log-Likelihood:                -142.54
No. Observations:                 100   AIC:                             289.1
Df Residuals:                      98   BIC:                             294.8
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.4963      0.569     -0.873      0.385      -1.624       0.631
x1             3.1216      0.155     20.086      0.000       2.813       3.430
==============================================================================
Omnibus:                        2.693   Durbin-Watson:                   2.147
Prob(Omnibus):                  0.260   Jarque-Bera (JB):                2.493
Skew:                           0.303   Prob(JB):                        0.288
Kurtosis:                       2.501   Cond. No.                         11.2
==============================================================================

Interpreting the Output

Description Note
Model The type of regression model used (e.g., Ordinary Least Squares).
R-Square The Coefficient of Determination, indicating the proportion of variance in the dependent variable, attributed to or explained by the independent variable(s).
Adj. R-Square Proportion of variance in the dependent variable explained by the independent variable(s), adjusted for the number of independent variables in the model. Takes into account the number of predictors in the model.

As more predictors are added, R-squared tends to increase, even if the new predictors do not contribute meaningfully to explaining the variance in the dependent variable.
F-Statistic The overall significance of the model.
Prob (F-Statistic) P-value associated with the F-statistic.

How to Plot the Data and Regression Line

# Plot the data and the fitted regression line 
plt.scatter(X, y, label='Data points') 
plt.plot(X, results.fittedvalues, color='red', label='Linear Regression') plt.xlabel('X') 
plt.ylabel('y') 
plt.legend() 
plt.show()
Powered by Forestry.md