Introduction
Given a dataset $D = {(X_{1}, Y_{2}), \dots,(X_{N}, Y_{N})}$ such as $X_{i}$ and $Y_{i}$ are continuous, The goal of "Linear Regression" is to find the best line that fits this data.
In other words, we want to create the model:
$$ \hat{y} = a*{0} + a*{1}.x*{1} + \dots + a*{p}.x_{p} $$
where $p$ is the number of dimensions of the variable $X$.
In this article we will see how to solve this problem in three scenarios:

When X is one dimensional i.e. $p=1$.

When X is multidimensional i.e. $p>1$.

Using gradient descent.
$X$ is one dimensional (Ordinary Least Square)
The model that we want to create is of shape:
$$ \hat{y} = a*{0} + a*{1}.x $$
Remember that the goal of linear regression is to find the line that best fits the data. In other words, we need to minimize the distance between the data points and the line.
$$ (\hat{a*{0}}, \hat{a*{1}}) = \underset{(a*{0}, a*{1})}{\operatorname{argmin}} \sum\limits*{i=1}^{N} (y*{i}  \hat{y*{i}})^2 $$
$$ = \underset{(a*{0}, a*{1})}{\operatorname{argmin}} \sum\limits*{i=1}^{N} (y*{i}  (a*{0} + a*{1}.x*{i}))^2 $$
Let's put:
$$ L = \sum\limits*{i=1}^{N} (y*{i}  (a*{0} + a*{1}.x_{i}))^2 $$
In order to find the minimum, we need to solve the following equations:
$$ \begin{cases} \frac{\partial L}{\partial a_{0}} = 0\ \frac{\partial L}{\partial a_{1}} = 0 \end{cases} $$
$$ \begin{cases} \sum\limits_{i=1}^{N} 2(y_{i}  (a_{0} + a_{1}.x_{i})) = 0\ \sum\limits_{i=1}^{N} 2x_{i}(y_{i}  (a_{0} + a_{1}.x_{i})) = 0 \end{cases} $$
We start by developing the first equation:
$$ \sum\limits_{i=1}^{N} y_{i}  \sum\limits_{i=1}^{N}a_{0} + \sum\limits_{i=1}^{N} a_{1}.x_{i} = 0\ $$
$$ \sum\limits_{i=1}^{N} y_{i}  Na_{0} + \sum\limits_{i=1}^{N} a_{1}.x_{i} = 0\ $$
$$ a_{0} = \frac{\sum\limits_{i=1}^{N} y_{i}}{N}  \frac{\sum\limits_{i=1}^{N} x_{i}}{N}a_{1} $$
$$ a_{0} = Y  Xa_{1} $$
We substitute in the second equation:
$$ \sum\limits_{i=1}^{N} x_{i}(y_{i}  Y + Xa_{1}  a_{1}x_{i}) = 0 $$
$$ \sum\limits_{i=1}^{N} (y_{i}  Y) + a_{1}(X  x_{i}) = 0 $$
$$ \sum\limits_{i=1}^{N} (y_{i}  Y)  \sum\limits_{i=1}^{N}a_{1}(x_{i}  X) = 0 $$
$$ a_{1} = \frac{\sum\limits_{i=1}^{N} (y_{i}  Y)}{\sum\limits_{i=1}^{N}(x_{i}  X)} = \frac{\sum\limits_{i=1}^{N} (y_{i}  Y)(x_{i}  X)}{\sum\limits_{i=1}^{N}(x_{i}  X)^2} = \frac{COV(X, Y)}{VAR(X)} $$
We substitute back in $a_{0}$:
$$ \begin{cases} a_{0} = Y  X\frac{COV(X, Y)}{VAR(X)}\ a_{1} = \frac{COV(X, Y)}{VAR(X)} \end{cases} $$
$X$ is multidimensional (Ordinary Least Square)
In this case, $X_{i}$ is no longer a real number, but instead it's a vector of size $p$:
$$ X*{i} = (X*{i1},X*{i2},\dots,X*{ip}) $$
So, the model is written as follow:
$$ \hat{y} = a*{0} + a*{1}x*{1} + a*{2}x*{2} + \dots + a*{p}x_{p} $$
or, it can be written in a matrix format:
$$ \hat{Y} = X.W $$
where:

$Y$ is of shape $(N, 1)$.

$X$ is of shape $(N, p)$.

$W$ is of shape $(p, 1)$: this is the parameters vector $(w_{1}, w_{2}, \dots, w_{p})$.
Similarly to the first case, we aim to minimize the following quantity:
$$ \hat{W} = \underset{W}{\operatorname{argmin}} \sum\limits*{i=1}^{N} (y*{i}  \hat{y_{i}})^2 $$
Again let's put:
$$ L = \sum\limits*{i=1}^{N} (y*{i}  \hat{y_{i}})^2 $$
$$ = (YXW)^{T}(YXW) $$
$$ = Y^TYY^TXWW^TX^TY+W^TX^TXW $$
$$ = Y^TY2W^TX^TY+W^TX^TXW $$
Since we want to minimize $L$ with respect to $W$, then we can ignore the first term "$Y^TY$" because it's independent of $W$ and let's solve the following equation:
$$ \frac{\partial (2W^TX^TY+W^TX^TXW)}{\partial W} = 0 $$
$$ 2X^TY+2X^TX\hat{W} = 0 $$
$$ \hat{W} = (X^TX)^{1}X^TY $$
Using gradient descent
Here is the formulation of the gradient descent algorithm:
$$ w*{n+1} = w*{n}  lr \times \frac{\partial f}{\partial w_{n}} $$
Now all we have to do is to apply it on the two parameters $a_{0}$ and $a_{1}$ (in the case of a one variable $X$):
$$ \begin{cases} a_{0}^{(n+1)} = a_{0}^{(n)}  lr \times \frac{\partial L}{\partial a_{0}}\ a_{1}^{(n+1)} = a_{1}^{(n)}  lr \times \frac{\partial L}{\partial a_{1}} \end{cases} $$
and we know that:
$$ \begin{cases} \frac{\partial L}{\partial a_{0}} = \sum\limits_{i=1}^{N} 2(y_{i}  (a_{0} + a_{1}.x_{i}))\ \frac{\partial L}{\partial a_{1}} = \sum\limits_{i=1}^{N} 2x_{i}(y_{i}  (a_{0} + a_{1}.x_{i})) \end{cases} $$
By substitution:
$$ \begin{cases} a_{0}^{(n+1)} = a_{0}^{(n)} + 2 \times lr \times \sum\limits_{i=1}^{N} (y_{i}  (a_{0}^{(n)} + a_{1}^{(n)}.x_{i}))\ a_{1}^{(n+1)} = a_{1}^{(n)} + 2 \times lr \times \sum\limits_{i=1}^{N} x_{i}(y_{i}  (a_{0}^{(n)} + a_{1}^{(n)}.x_{i})) \end{cases} $$
Quiz

What is the formula of the optimal parameters vector in the case of multidimensional linear regression:

$\frac{COV(X, Y)}{VAR(Y)}$

$\frac{COV(X, Y)}{VAR(X)}$

$(X^TX)^{1}X^TY$ "correct"

Why do we put the derivative to 0?

To find the extremum. "correct"

To minimize the derivative.

To only keep the real part of the derivative.

What is the objective of linear regression ?

To find the line that passes by all the points.

To find the line that best describes the data."correct"

To find the line that best separates the data.