About Linear Regression

tags: #ML/supervised/regression

What is linear regression?

Linear Regression ~ Ordinary Least Squares

Linear regression is also known as Ordinary Least Squares Multiple Linear Regression

  • "Linear" - bc the underlying relationship is assumed to be linear (this is also one of the assumptions for using linear regression), and the equation to model the relationship is based on the linear equation of: y=mx+b

  • "Least squares" - technique used to find the line of best fit

  • "Ordinary" - type of least squares method

  • "Multiple" - bc the regression can include more than 1 independent variable

The linear regression model is a statistical (and ML) model that can be applied to:

Note: with regression models, you are predicting a real-valued number.

Assumptions Check: Linearity

We can check for the presence of linearity using a scatterplot:

Understanding the "Line of Best Fit"

The goal of a regression model is to find a line of best fit that would minimize the difference between the predicted output (y) and the real-valued t for all data points to the regression line.

The equation of the line of best fit can be represented as:

y=mx+b
or...
y=bx+b0

Approximating the Line of Best Fit: OLS

We can approximate the line of best fit using the ordinary least squares method:

Definition: Best fitting line

In the context of OLS, the best fitting line is the line where the residual sum of squares is at a minimum. This is our "loss function" for linear regression.

Therefore, a SSresiduals of 0 is one that will produce the "best fitting line".

The least-sqaures method is the technique used to find the line of best fit by minimizing the sum of the squared residuals of each data point from the regression line, such that:

SSresiduals=i=1n(yiyi^)2

The residual (AKA prediction error) is the difference between the observed Y and the predicted Y:

e=yy^

Goal of linear regression is to find a line that minimizes the residual error each observation by adjusting the parameters of the linear equation in y=a+bx that makes the sum of the squared residuals as small as possible:

Optimization Algorithm: Gradient Descent

This is achieved through an optimization algorithm called "gradient descent" used to find the best set of parameters (w, b), that minimizes the loss function by finding the global minima of the performance surface of the function.

Screen Shot 2023-04-27 at 9.51.33 PM.png500

The lowest point on the performance surface correspond to the optimal set of parameters where the loss function is minimized and the accuracy is the highest.

Caveat! Works well when the function is convex, otherwise, the algorithm can get stuck at sub-optimal solutions ("saddle points" or "local minimas"). Not the case for linear regression as the loss function is convex, therefore, it has only one minimum point that is the global minima.

The line that produces the smallest sum of squares is the best fitting line

Illustration:

Screen Shot 2023-04-24 at 6.31.20 PM.png

Dummy Variables

For nominal variables that convey only classification information - we can create dummy variables for each class of a categorical variable; this allows for multiple comparisons to be made for each subgroup in a single regression model.

Each dummy variable is coded as 0 or 1 depending on whether that individual is in that category.

A regression model can have many dummy variables and be used as independent variables.

When dummy variables are used as dependent variables, we have binary or multi-class classification.


Additional Resources
Powered by Forestry.md