Running Linear Regression with sklearn

Step 1: Choose a class of model

In sklearn, every class of model is represented by a Python class.

To compute a simple linear regression model, we can import the linear regression class:

from sklearn.linear_model import LinearRegression

Note: This class sklearn.linear_model.LinearRegression can be used to perform both linear and polynomial regression (for nonlinear relationships) and make predictions accordingly.

Step 2: Create an instance of the class and choose model hyperparameters

In sklearn, hyperparameters are chosen by passing values at model instantiation.

Example, we can instantiate the LinearRegression class and specify that we would like to fit the intercept using the fit_intercept hyperparameter (True at default):

model = LinearRegression(fit_intercept=True)

Step 3: Arrange data into a features matrix and target vector, split data

See also: Partitioning the Dataset

Recall: the data representation in sklearn requires a two-dimensional features matrix and a one-dimensional target array.

#split dataset in features and target variable 
feature_cols = ["LIST OF COLNAMES"] 
X = df[feature_cols] # Features 
y = df["target_var"] # Target variable

Once we have arranged the data into a feature matrix and target vector, we can partition the data into a training and test set:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1, stratify=y) # y if the dataset is imbalanced

Step 4: Fit Model to the Data

We can then train the model on the training set using the .fit() method:

model.fit(X_train, y_train)

This prompts the computation of optimal values of the weights 𝑏₀ and 𝑏₁, using the existing input and output, x and y, as the arguments.

.fit()

fit loads the data into the model (estimator) and performs complex calculations behind the scenes that 'learn' from the data to train the model.

Results of these computations are stored in model-specific attributes that the user can explore:

# to get regression coefficients
model.coef_

# to get regression intercept (average response of the DV when all IDV are 0)
model.intercept_

Note: By convention all model parameters that were learned during the fit() process have trailing underscores

To obtain the coefficient of determination (R-squared) for model performance:

model.score(x, y)

When you’re applying .score(), the arguments are also the predictor xand response y, and the return value is 𝑅².

Step 5: Predictions

Once you have fitted the LinearRegression model using the fit() method, you can use the predict() method to generate predicted values for new data points:

y_pred = model.predict(X_test) # returns a 1-D array

Results are stored in an array which we can print out using the print() function.

Alternatively, we can manually get the predicted output:

y_pred = model.intercept_ + np.dot(X_train, model.coef_)

The np.dot() function is used to compute the dot product between the matrix X_train and the coefficient vector model.coef_, which yields a 1D array of predicted values. The intercept term is added to this array to get the final predicted values.

Note: X_train must be a matrix of shape (n_samples, n_features), where n_samples is the number of samples in the training set and n_features is the number of features.