Linear Regression

math
linear regression
Linear Regression cover image

Introduction

Given a dataset D={(X1,Y2),,(XN,YN)}D = \{(X_{1}, Y_{2}), \dots,(X_{N}, Y_{N})\} such as XiX_{i} and YiY_{i} are continuous, The goal of "Linear Regression" is to find the best line that fits this data.

In other words, we want to create the model:

y^=a0+a1.x1++ap.x_p\hat{y} = a*{0} + a*{1}.x*{1} + \dots + a*{p}.x\_{p}

where pp is the number of dimensions of the variable XX.

In this article we will see how to solve this problem in three scenarios:

  • When X is one dimensional i.e. p=1p=1.

  • When X is multi-dimensional i.e. p>1p>1.

  • Using gradient descent.

XX is one dimensional (Ordinary Least Square)

The model that we want to create is of shape:

y^=a0+a1.x\hat{y} = a*{0} + a*{1}.x

Remember that the goal of linear regression is to find the line that best fits the data. In other words, we need to minimize the distance between the data points and the line.

(a0^,a1^)=argmin(a0,a1)i=1N(yiyi^)2(\hat{a*{0}}, \hat{a*{1}}) = \underset{(a*{0}, a*{1})}{\operatorname{argmin}} \sum\limits*{i=1}^{N} (y*{i} - \hat{y*{i}})^2

=argmin(a0,a1)i=1N(yi(a0+a1.xi))2= \underset{(a*{0}, a*{1})}{\operatorname{argmin}} \sum\limits*{i=1}^{N} (y*{i} - (a*{0} + a*{1}.x*{i}))^2

Let's put:

L=i=1N(yi(a0+a1.x_i))2L = \sum\limits*{i=1}^{N} (y*{i} - (a*{0} + a*{1}.x\_{i}))^2

In order to find the minimum, we need to solve the following equations:

{La0=0La1=0\begin{cases} \frac{\partial L}{\partial a_{0}} = 0\\ \frac{\partial L}{\partial a_{1}} = 0 \end{cases}
{i=1N2(yi(a0+a1.xi))=0i=1N2xi(yi(a0+a1.xi))=0\begin{cases} \sum\limits_{i=1}^{N} -2(y_{i} - (a_{0} + a_{1}.x_{i})) = 0\\ \sum\limits_{i=1}^{N} -2x_{i}(y_{i} - (a_{0} + a_{1}.x_{i})) = 0 \end{cases}

We start by developing the first equation:

i=1Nyii=1Na0+i=1Na1.xi=0\sum\limits_{i=1}^{N} y_{i} - \sum\limits_{i=1}^{N}a_{0} + \sum\limits_{i=1}^{N} a_{1}.x_{i} = 0\\
i=1NyiNa0+i=1Na1.xi=0\sum\limits_{i=1}^{N} y_{i} - Na_{0} + \sum\limits_{i=1}^{N} a_{1}.x_{i} = 0\\
a0=i=1NyiNi=1NxiNa1a_{0} = \frac{\sum\limits_{i=1}^{N} y_{i}}{N} - \frac{\sum\limits_{i=1}^{N} x_{i}}{N}a_{1}
a0=YXa1a_{0} = Y - Xa_{1}

We substitute in the second equation:

i=1Nxi(yiY+Xa1a1xi)=0\sum\limits_{i=1}^{N} x_{i}(y_{i} - Y + Xa_{1} - a_{1}x_{i}) = 0
i=1N(yiY)+a1(Xxi)=0\sum\limits_{i=1}^{N} (y_{i} - Y) + a_{1}(X - x_{i}) = 0
i=1N(yiY)i=1Na1(xiX)=0\sum\limits_{i=1}^{N} (y_{i} - Y) - \sum\limits_{i=1}^{N}a_{1}(x_{i} - X) = 0
a1=i=1N(yiY)i=1N(xiX)=i=1N(yiY)(xiX)i=1N(xiX)2=COV(X,Y)VAR(X)a_{1} = \frac{\sum\limits_{i=1}^{N} (y_{i} - Y)}{\sum\limits_{i=1}^{N}(x_{i} - X)} = \frac{\sum\limits_{i=1}^{N} (y_{i} - Y)(x_{i} - X)}{\sum\limits_{i=1}^{N}(x_{i} - X)^2} = \frac{COV(X, Y)}{VAR(X)}

We substitute back in a0a_{0}:

{a0=YXCOV(X,Y)VAR(X)a1=COV(X,Y)VAR(X)\begin{cases} a_{0} = Y - X\frac{COV(X, Y)}{VAR(X)}\\ a_{1} = \frac{COV(X, Y)}{VAR(X)} \end{cases}

XX is multi-dimensional (Ordinary Least Square)

In this case, XiX_{i} is no longer a real number, but instead it's a vector of size pp:

Xi=(Xi1,Xi2,,Xip)X*{i} = (X*{i1},X*{i2},\dots,X*{ip})

So, the model is written as follow:

y^=a0+a1x1+a2x2++apx_p\hat{y} = a*{0} + a*{1}x*{1} + a*{2}x*{2} + \dots + a*{p}x\_{p}

or, it can be written in a matrix format:

Y^=X.W\hat{Y} = X.W

where:

  • YY is of shape (N,1)(N, 1).

  • XX is of shape (N,p)(N, p).

  • WW is of shape (p,1)(p, 1): this is the parameters vector (w1,w2,,wp)(w_{1}, w_{2}, \dots, w_{p}).

Similarly to the first case, we aim to minimize the following quantity:

W^=argminWi=1N(yiy_i^)2\hat{W} = \underset{W}{\operatorname{argmin}} \sum\limits*{i=1}^{N} (y*{i} - \hat{y\_{i}})^2

Again let's put:

L=i=1N(yiy_i^)2L = \sum\limits*{i=1}^{N} (y*{i} - \hat{y\_{i}})^2

=(YXW)T(YXW)= (Y-XW)^{T}(Y-XW)
=YTYYTXWWTXTY+WTXTXW= Y^TY-Y^TXW-W^TX^TY+W^TX^TXW
=YTY2WTXTY+WTXTXW= Y^TY-2W^TX^TY+W^TX^TXW

Since we want to minimize LL with respect to WW, then we can ignore the first term "YTYY^TY" because it's independent of WW and let's solve the following equation:

(2WTXTY+WTXTXW)W=0\frac{\partial (-2W^TX^TY+W^TX^TXW)}{\partial W} = 0
2XTY+2XTXW^=0-2X^TY+2X^TX\hat{W} = 0
W^=(XTX)1XTY\hat{W} = (X^TX)^{-1}X^TY

Using gradient descent

Here is the formulation of the gradient descent algorithm:

wn+1=wnlr×fw_nw*{n+1} = w*{n} - lr \times \frac{\partial f}{\partial w\_{n}}

Now all we have to do is to apply it on the two parameters a0a_{0} and a1a_{1} (in the case of a one variable XX):

{a0(n+1)=a0(n)lr×La0a1(n+1)=a1(n)lr×La1\begin{cases} a_{0}^{(n+1)} = a_{0}^{(n)} - lr \times \frac{\partial L}{\partial a_{0}}\\ a_{1}^{(n+1)} = a_{1}^{(n)} - lr \times \frac{\partial L}{\partial a_{1}} \end{cases}

and we know that:

{La0=i=1N2(yi(a0+a1.xi))La1=i=1N2xi(yi(a0+a1.xi))\begin{cases} \frac{\partial L}{\partial a_{0}} = \sum\limits_{i=1}^{N} -2(y_{i} - (a_{0} + a_{1}.x_{i}))\\ \frac{\partial L}{\partial a_{1}} = \sum\limits_{i=1}^{N} -2x_{i}(y_{i} - (a_{0} + a_{1}.x_{i})) \end{cases}

By substitution:

{a0(n+1)=a0(n)+2×lr×i=1N(yi(a0(n)+a1(n).xi))a1(n+1)=a1(n)+2×lr×i=1Nxi(yi(a0(n)+a1(n).xi))\begin{cases} a_{0}^{(n+1)} = a_{0}^{(n)} + 2 \times lr \times \sum\limits_{i=1}^{N} (y_{i} - (a_{0}^{(n)} + a_{1}^{(n)}.x_{i}))\\ a_{1}^{(n+1)} = a_{1}^{(n)} + 2 \times lr \times \sum\limits_{i=1}^{N} x_{i}(y_{i} - (a_{0}^{(n)} + a_{1}^{(n)}.x_{i})) \end{cases}

Quiz

  • What is the formula of the optimal parameters vector in the case of multidimensional linear regression:

  • COV(X,Y)VAR(Y)\frac{COV(X, Y)}{VAR(Y)}

  • COV(X,Y)VAR(X)\frac{COV(X, Y)}{VAR(X)}

  • (XTX)1XTY(X^TX)^{-1}X^TY "correct"

  • Why do we put the derivative to 0?

  • To find the extremum. "correct"

  • To minimize the derivative.

  • To only keep the real part of the derivative.

  • What is the objective of linear regression ?

  • To find the line that passes by all the points.

  • To find the line that best describes the data."correct"

  • To find the line that best separates the data.


Career Services background pattern

Career Services

Contact Section background image

Let’s stay in touch

Code Labs Academy © 2024 All rights reserved.