Running Linear Regression with sklearn
tags: #ML/supervised/regression
Step 1: Choose a class of model
In sklearn, every class of model is represented by a Python class.
To compute a simple linear regression model, we can import the linear regression class:
from sklearn.linear_model import LinearRegression
Note: This class sklearn.linear_model.LinearRegression can be used to perform both linear and polynomial regression (for nonlinear relationships) and make predictions accordingly.
Step 2: Create an instance of the class and choose model hyperparameters
In sklearn, hyperparameters are chosen by passing values at model instantiation.
Example, we can instantiate the LinearRegression class and specify that we would like to fit the intercept using the fit_intercept hyperparameter (True at default):
model = LinearRegression(fit_intercept=True)
Step 3: Arrange data into a features matrix and target vector, split data
- See also: Partitioning the Dataset
Recall: the data representation in sklearn requires a two-dimensional features matrix and a one-dimensional target array.
#split dataset in features and target variable
feature_cols = ["LIST OF COLNAMES"]
X = df[feature_cols] # Features
y = df["target_var"] # Target variable
Once we have arranged the data into a feature matrix and target vector, we can partition the data into a training and test set:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1, stratify=y) # y if the dataset is imbalanced
Step 4: Fit Model to the Data
We can then train the model on the training set using the .fit() method:
model.fit(X_train, y_train)
This prompts the computation of optimal values of the weights 𝑏₀ and 𝑏₁, using the existing input and output, x and y, as the arguments.
.fit()
fit loads the data into the model (estimator) and performs complex calculations behind the scenes that 'learn' from the data to train the model.
Results of these computations are stored in model-specific attributes that the user can explore:
# to get regression coefficients
model.coef_
# to get regression intercept (average response of the DV when all IDV are 0)
model.intercept_
Note: By convention all model parameters that were learned during the fit() process have trailing underscores
To obtain the coefficient of determination (R-squared) for model performance:
model.score(x, y)
When you’re applying .score(), the arguments are also the predictor xand response y, and the return value is 𝑅².
Step 5: Predictions
Once you have fitted the LinearRegression model using the fit() method, you can use the predict() method to generate predicted values for new data points:
y_pred = model.predict(X_test) # returns a 1-D array
Results are stored in an array which we can print out using the print() function.
Alternatively, we can manually get the predicted output:
y_pred = model.intercept_ + np.dot(X_train, model.coef_)
The
np.dot()function is used to compute the dot product between the matrixX_trainand the coefficient vectormodel.coef_, which yields a 1D array of predicted values. The intercept term is added to this array to get the final predicted values.Note:
X_trainmust be a matrix of shape(n_samples, n_features), wheren_samplesis the number of samples in the training set andn_featuresis the number of features.