Running Logit in OLS

tags: #ML/supervised/classification/logit

Step 1. Import required packages

import statsmodels.api as sm

Step 2. Import required dataset

You can get the inputs and output the same way as you did with scikit-learn and split dataset into a matrix of predictors and target vector containing the label:

#split dataset in features and target variable 
feature_cols = ["LIST OF COLNAMES"] 
X = df[feature_cols] # Features 
y = df["target_var"] # Target variable

Caveat: Intercept

StatsModels doesn’t take the intercept 𝑏₀ into account. Therefore, need to include the additional column of ones to your feature matrix.

You do that with add_constant():

X = sm.add_constant(x)

Why do we do this?

Intercept is not added by default in Statsmodels regression, so you need to include it manually.

Step 3. Partition the Dataset

See also: Partitioning the Dataset

# split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size, random_state)

Step 4. Fit and Train Model

Your logistic regression model is going to be an instance of the class statsmodels.discrete.discrete_model.Logit.

To create an instance of the logit object:

>>> model = sm.Logit(y_train, X_train) # Note that the first argument here is `y`, followed by `x`.

To fit the model with existing data:

result = model.fit()

Step 4. Obtain Results

To print results output:

result.summary() # or summary2

You can obtain the values of 𝑏₀ and 𝑏₁ with .params:

>>> result.params
array([b0  ,  b1]) # intercept and slope

Step 4. Evaluate the model

You can evaluate the model by first generating the predicted output of the model to the testing set using the .predict() function and generate an accuracy report or confusion matrix;

>>> result.predict(X_test)

You can use their values to get the actual predicted outputs (classification):

>>> (result.predict(x) >= 0.5).astype(int)
array([0, 0, 0, 1, 1, 1, 1, 1, 1, 1])