## Introduction

Given a dataset $D = {(X_{1}, Y_{2}), \dots,(X_{N}, Y_{N})}$ such as $X_{i}$ and $Y_{i}$ are continuous, The goal of "Linear Regression" is to find the best line that fits this data.

In other words, we want to create the model:

$$ \hat{y} = a*{0} + a*{1}.x*{1} + \dots + a*{p}.x_{p} $$

where $p$ is the number of dimensions of the variable $X$.

In this article we will see how to solve this problem in three scenarios:

- When X is one dimensional i.e. $p=1$.
- When X is multi-dimensional i.e. $p>1$.
- Using gradient descent.

## $X$ is one dimensional (Ordinary Least Square)

The model that we want to create is of shape: $$ \hat{y} = a*{0} + a*{1}.x $$

Remember that the goal of linear regression is to find the line that best fits the data. In other words, we need to minimize the distance between the data points and the line.

$$ (\hat{a*{0}}, \hat{a*{1}}) = \underset{(a*{0}, a*{1})}{\operatorname{argmin}} \sum\limits*{i=1}^{N} (y*{i} - \hat{y*{i}})^2 $$ $$ = \underset{(a*{0}, a*{1})}{\operatorname{argmin}} \sum\limits*{i=1}^{N} (y*{i} - (a*{0} + a*{1}.x*{i}))^2 $$

Let's put: $$ L = \sum\limits*{i=1}^{N} (y*{i} - (a*{0} + a*{1}.x_{i}))^2 $$

In order to find the minimum, we need to solve the following equations:

$$ \begin{cases} \frac{\partial L}{\partial a_{0}} = 0\ \frac{\partial L}{\partial a_{1}} = 0 \end{cases} $$

$$ \begin{cases} \sum\limits_{i=1}^{N} -2(y_{i} - (a_{0} + a_{1}.x_{i})) = 0\ \sum\limits_{i=1}^{N} -2x_{i}(y_{i} - (a_{0} + a_{1}.x_{i})) = 0 \end{cases} $$

We start by developing the first equation:

$$ \sum\limits_{i=1}^{N} y_{i} - \sum\limits_{i=1}^{N}a_{0} + \sum\limits_{i=1}^{N} a_{1}.x_{i} = 0\ $$

$$ \sum\limits_{i=1}^{N} y_{i} - Na_{0} + \sum\limits_{i=1}^{N} a_{1}.x_{i} = 0\ $$

$$ a_{0} = \frac{\sum\limits_{i=1}^{N} y_{i}}{N} - \frac{\sum\limits_{i=1}^{N} x_{i}}{N}a_{1} $$

$$ a_{0} = Y - Xa_{1} $$

We substitute in the second equation:

$$ \sum\limits_{i=1}^{N} x_{i}(y_{i} - Y + Xa_{1} - a_{1}x_{i}) = 0 $$

$$ \sum\limits_{i=1}^{N} (y_{i} - Y) + a_{1}(X - x_{i}) = 0 $$

$$ \sum\limits_{i=1}^{N} (y_{i} - Y) - \sum\limits_{i=1}^{N}a_{1}(x_{i} - X) = 0 $$

$$ a_{1} = \frac{\sum\limits_{i=1}^{N} (y_{i} - Y)}{\sum\limits_{i=1}^{N}(x_{i} - X)} = \frac{\sum\limits_{i=1}^{N} (y_{i} - Y)(x_{i} - X)}{\sum\limits_{i=1}^{N}(x_{i} - X)^2} = \frac{COV(X, Y)}{VAR(X)} $$

We substitute back in $a_{0}$:

$$ \begin{cases} a_{0} = Y - X\frac{COV(X, Y)}{VAR(X)}\ a_{1} = \frac{COV(X, Y)}{VAR(X)} \end{cases} $$

## $X$ is multi-dimensional (Ordinary Least Square)

In this case, $X_{i}$ is no longer a real number, but instead it's a **vector** of size $p$:
$$ X*{i} = (X*{i1},X*{i2},\dots,X*{ip}) $$

So, the model is written as follow: $$ \hat{y} = a*{0} + a*{1}x*{1} + a*{2}x*{2} + \dots + a*{p}x_{p} $$

or, it can be written in a matrix format: $$ \hat{Y} = X.W $$

where:

- $Y$ is of shape $(N, 1)$.
- $X$ is of shape $(N, p)$.
- $W$ is of shape $(p, 1)$: this is the parameters vector $(w_{1}, w_{2}, \dots, w_{p})$.

Similarly to the first case, we aim to minimize the following quantity:

$$ \hat{W} = \underset{W}{\operatorname{argmin}} \sum\limits*{i=1}^{N} (y*{i} - \hat{y_{i}})^2 $$

Again let's put: $$ L = \sum\limits*{i=1}^{N} (y*{i} - \hat{y_{i}})^2 $$

$$ = (Y-XW)^{T}(Y-XW) $$

$$ = Y^TY-Y^TXW-W^TX^TY+W^TX^TXW $$

$$ = Y^TY-2W^TX^TY+W^TX^TXW $$

Since we want to minimize $L$ with respect to $W$, then we can ignore the first term "$Y^TY$" because it's independent of $W$ and let's solve the following equation:

$$ \frac{\partial (-2W^TX^TY+W^TX^TXW)}{\partial W} = 0 $$

$$ -2X^TY+2X^TX\hat{W} = 0 $$

$$ \hat{W} = (X^TX)^{-1}X^TY $$

## Using gradient descent

Here is the formulation of the gradient descent algorithm: $$ w*{n+1} = w*{n} - lr \times \frac{\partial f}{\partial w_{n}} $$

Now all we have to do is to apply it on the two parameters $a_{0}$ and $a_{1}$ (in the case of a one variable $X$):

$$ \begin{cases} a_{0}^{(n+1)} = a_{0}^{(n)} - lr \times \frac{\partial L}{\partial a_{0}}\ a_{1}^{(n+1)} = a_{1}^{(n)} - lr \times \frac{\partial L}{\partial a_{1}} \end{cases} $$

and we know that:

$$ \begin{cases} \frac{\partial L}{\partial a_{0}} = \sum\limits_{i=1}^{N} -2(y_{i} - (a_{0} + a_{1}.x_{i}))\ \frac{\partial L}{\partial a_{1}} = \sum\limits_{i=1}^{N} -2x_{i}(y_{i} - (a_{0} + a_{1}.x_{i})) \end{cases} $$

By substitution:

$$ \begin{cases} a_{0}^{(n+1)} = a_{0}^{(n)} + 2 \times lr \times \sum\limits_{i=1}^{N} (y_{i} - (a_{0}^{(n)} + a_{1}^{(n)}.x_{i}))\ a_{1}^{(n+1)} = a_{1}^{(n)} + 2 \times lr \times \sum\limits_{i=1}^{N} x_{i}(y_{i} - (a_{0}^{(n)} + a_{1}^{(n)}.x_{i})) \end{cases} $$

## Quiz

- What is the formula of the optimal parameters vector in the case of multidimensional linear regression:
- $\frac{COV(X, Y)}{VAR(Y)}$
- $\frac{COV(X, Y)}{VAR(X)}$
- $(X^TX)^{-1}X^TY$
**"correct"**

- Why do we put the derivative to 0?
- To find the extremum.
**"correct"** - To minimize the derivative.
- To only keep the real part of the derivative.

- To find the extremum.
- What is the objective of linear regression ?
- To find the line that passes by all the points.
- To find the line that best describes the data.
**"correct"** - To find the line that best separates the data.