Hey Guys! This is Dhyuthidhar, Welcome to my Blog. If you don’t know me...", you could say "Hi, I’m Dhyuthidhar, and if you're new here, welcome! I love writing about all things in Computer Science, especially in the realm of Machine Learning. Here is your new interesting blog which talks about, a supervised model that is Linear Regression. I will explain the topic in detail and also try to explain it in a simple way so that you can understand.

Buckle up, Guys!

What is Linear Regression?

Linear regression is a supervised learning model, meaning the data is pre-embedded into the model and labeled (there are target values and predictors(input)).
It is a widely used model for regression tasks.
Regression means predicting a numerical value instead of a class, as in classification.
It measures the relation between a dependent variable and one or more independent variables in a linear approach.

Problem Statement

In this algorithm model, we will have an input of {xi,yi}i = 1 to N.
- N: Total number of training examples.
  - xi: Feature vector (input) of DDD dimensions for each example.
  - yi: Target variable (output) for each example.
So the linear regression model f w,b(x) is the linear combination of the features of the example x.

$$f_{\mathbf{w}, b}(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b$$

Where:

w^T is the transpose of the weight vector w.
x is the feature vector.
w^Tx is the dot product of w and x.
The linear regression function is parameterized by w,b.

For every single input(xi)vector of D dimensions in the data, we will find the y(i)

$$f_{\mathbf{w}, b}(\mathbf{x}) = y$$

The output of the function or the model depends on the parameters w and b. If two models have different parameters then they will give two different types of results on the same example.
So we will try the model implementation on different w,b values and we will get optimal values (w*,b*) which will get the best accurate model results.

Latter Requirement -: The hyperplane in the linear regression is chosen to be as close to all training examples as possible.

Based on the above conditions(latter requirement) you can see in the illustration where you can see a hyperplane(red line) that is near the data points(blue dots).
we can use the hyperplane(red line) that predicts the Ynew for a new unlabelled example Xnew.
There is a hyperplane in the linear regression in the data plane and the hyperplane should be as near as the data points.
If our table has two features then the hyperplane is a one-dimensional line and if we have three features then there will be a two-dimensional plane and if there are more than three features then there will be a hyperplane as a regression line.
The regression model forms a hyperplane. For a dataset with D features, this hyperplane exists in D+1 dimensions.
If the regression line is far from the data points then there is a chance of incorrect predictions for the Ynew.

Solution

To get this latter requirement we need to use the optimization method which minimizes the cost function and finds the optimal values w* and b*. The cost function of linear regression is often the Mean Squared Error(MSE).

$$\text{Cost}(w, b) = \frac{1}{N} \sum_{i=1}^{N} \left( f_{w,b}(x_i) - y_i \right)^2$$

In mathematics, the function which we minimize or maximize is called an objective function or simply an objective.
So in the above cost function (f(x) - y)^2 is the objective or loss function.
If there is a miss classification of example i then this loss function is the penalty.

The loss function is also referred to as squared error loss.

$$L(y, f_{\mathbf{w},\mathbf{b}}(x)) = (y - f_{\mathbf{w},\mathbf{b}}(x))^2$$

The function which tries to optimize the objective function or loss function is the cost function which is Mean Squared Error(MSE) also called empirical risk or average loss.
The cost function is also called empirical risk or average loss.

Why loss function is quadratic function?

Why not it is an absolute difference?
We can use the absolute difference as well as the cubic function also as a loss function.
When designing an ML model we make a lot of arbitrary decisions.
- For the model development, we used a linear combination.
- We can use other things like squared or some other polynomial features.
- For the loss function, we used squared error.
- We can use other things like absolute difference or cube of difference or binary loss(11 when the prediction is wrong and 0 when it's correct).
- Different algorithms can minimize these loss functions to find the best parameters for the model.
Each choice impacts how the model performs.

Why do people invent new learning algorithms?

There are two reasons why they invent them -:
- The model solves a specific problem better than other models on the specific problem.
- The new model has better theoretical guarantees on the quality of the model it produces.
we are using the linear form for this model.
- Why? Because it is simple compared to others.
- And it also rarely overfits.
Overfitting is the problem faced by a model which learns the training data perfectly but when we give some new data(test) then it won’t work as well as when it gets a trained example as input.
Below is the figure of polynomial regression that overfits -:
It's like you will do good in the classroom but won’t get good marks in the final exam.
We can talk about this topic in a future blog.
So these models use statistical methods. So what is learning in the model then?
As you can see the model trains with some parameters like w and b, which always change.
The model uses a mathematical procedure known as "loss function" while training. This function is used to calculate the error we got after training the model.
So now we got to know why we are using linear form for regression instead of some other polynomial like 10 or something. Because linear form rarely overfits.
Then what about the loss function, why we are using squared error?

why we are using squared error for the Loss function

Why did we decide the loss function should be squared?
For the answer we need to dig into the history, let's go back to the 1700s.
In 1705, a French mathematician Adrien-Marie Legendre, published that squared error is used to calculate the quality of the model before summing is convenient.
As the absolute value doesn't have a continuous derivative it makes the function less smooth.
Smooth functions help the linear model to make a straightforward prediction or closed-form solutions (straightforward algebraic formulas).
Closed-form solutions are preferable because they are simple and efficient. Numerical methods (like gradient descent used in training neural networks) are more complex and computationally intensive.
Using squared differences (errors) in the loss function amplifies larger errors more than smaller ones, making the model more sensitive to big mistakes.
But using cubic or power of 4 will be tough for the derivative.
Calculating the derivative (or gradient) of the loss function helps us find the optimal parameters (w* and b*) that minimize the loss. By setting the gradient to zero, we can solve for these optimal values efficiently.
To find the global minimum, we first find critical points by setting the derivative to zero and then use optimization techniques like gradient descent to navigate towards the global minimum.
We use gradient descent to optimize the cost function to find the w* and b*.

Conclusion

Linear regression finds the best-fit line (or hyperplane) that minimizes errors between predicted and actual values. The loss function measures errors for individual examples, while the cost function, often the Mean Squared Error (MSE), aggregates errors across the dataset. Gradient descent iteratively adjusts parameters to minimize the cost function, converging towards the global minimum. Critical points, identified by setting the gradient to zero, guide this optimization. Various design choices, like feature selection and model type, impact performance. Squaring errors simplify calculations, aiding linear algebra. Overall, linear regression optimizes loss and cost functions to build effective predictive models.

So if you want to learn more you can refer to some machine learning as well as statistics books like The Hundred Days of Machine Learning, Practical Statistics etc. Try to research the implementation, I will talk about the implementation in the next blog.

Linear Regression Demystified: From Concepts to Optimization