Running Linear Regression
tags: #ML/supervised/regression
How to Run a Regression Model
To run a linear regression model in statsmodel.api:
import statsmodels.api as sm
import pandas as pd
import numpy as np
# read dataset
dataset = pd.read_csv('FILENAME.csv')
# split data into a matrix of feature and target vector
feature_cols = ["LIST OF COLNAMES"]
X = df[feature_cols] # Features
X = sm.add_constant(X) # to account for intercept
y = df[target_var] # Target variable
# split into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size, random_state)
# create instance of the OLS model
model = sm.OLS(y_train, X_train).fit()
# fit model
result = model.fit()
# summary
result.summary()
The result.summary() method provides a detailed summary of the regression analysis including:
- Coefficients
- P-values
- Goodness of Fit statistics
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.891
Model: OLS Adj. R-squared: 0.889
Method: Least Squares F-statistic: 403.5
Date: Mon, 10 Jan 2024 Prob (F-statistic): 1.52e-24
Time: 12:34:56 Log-Likelihood: -142.54
No. Observations: 100 AIC: 289.1
Df Residuals: 98 BIC: 294.8
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -0.4963 0.569 -0.873 0.385 -1.624 0.631
x1 3.1216 0.155 20.086 0.000 2.813 3.430
==============================================================================
Omnibus: 2.693 Durbin-Watson: 2.147
Prob(Omnibus): 0.260 Jarque-Bera (JB): 2.493
Skew: 0.303 Prob(JB): 0.288
Kurtosis: 2.501 Cond. No. 11.2
==============================================================================
Interpreting the Output
| Description | Note | |
|---|---|---|
| Model | The type of regression model used (e.g., Ordinary Least Squares). | |
| R-Square | The Coefficient of Determination, indicating the proportion of variance in the dependent variable, attributed to or explained by the independent variable(s). | |
| Adj. R-Square | Proportion of variance in the dependent variable explained by the independent variable(s), adjusted for the number of independent variables in the model. | Takes into account the number of predictors in the model. As more predictors are added, R-squared tends to increase, even if the new predictors do not contribute meaningfully to explaining the variance in the dependent variable. |
| F-Statistic | The overall significance of the model. | |
| Prob (F-Statistic) | P-value associated with the F-statistic. |
How to Plot the Data and Regression Line
# Plot the data and the fitted regression line
plt.scatter(X, y, label='Data points')
plt.plot(X, results.fittedvalues, color='red', label='Linear Regression') plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.show()