Gradient Descent

Gradient Descent

#deep learning#math# gradient descent

Introduction

Imagine that we have a function f(x)f(x) and we would like to find its minimum. What would you do ?

Simple right ? We only need to solve the following equation: fβ€²(x)=0f'(x) = 0

The thing is that finding the formula of fβ€²f' is not always easy as they tend to be extremely complicated especially in deep learning where we deal with complex functions. So we need to find another method that can provide us the minimum of a function without the need of finding the formula of the derivative fβ€²f'.

Let's build some intuition

Let's suppose that we have a function f with the corresponding graph:

Graph 1

Let's start with a random point x0x_{0}. The goal is to move this point and make it closer and closer to xβˆ—x* such that fβ€²(f'(x*)=0) = 0. So the problem can be divided in two parts:

  • In which direction should we move the point xx ? Left or Right ?
  • How much should we move it ?

The direction

Let’s build some intuition in order to answer the first question. Take a look at the following point:

Graph 2 Graph 3

Note that:

  • when the point x0x_{0} is to the right of the optimal point xβˆ—x* its tangent line goes up.
  • when the point x0x_{0} is to the right of the optimal point xβˆ—x* its tangent line goes down.

The direction of a line is determined by the sign of its slope:

  • A line goes up β€…β€ŠβŸΉβ€…β€Š\implies the slope aa is positive.
  • A line goes down β€…β€ŠβŸΉβ€…β€Š\implies the slope aa is negative.

Note that:
The slope of the tangent line of a function in a certain point x0x_{0} is no more than the derivative in that point fβ€²(x0)f'(x_{0}): tangent(xβˆ—0):g(x)=fβ€²(xβˆ—0).(xβˆ’xβˆ—0)+f(xβˆ—0)tangent(x*{0}): g(x) = f'(x*{0}).(x-x*{0}) + f(x*{0})

So as an answer to the question "Where should we move x0x_{0} ?":

  • fβ€²(x0)<0f'(x_{0}) < 0 β€…β€ŠβŸΉβ€…β€Š\implies x0x_{0} to the right of xβˆ—x* β€…β€ŠβŸΉβ€…β€Š\implies We need to move x0x_{0} to the left.
  • fβ€²(x0)>0f'(x_{0}) > 0 β€…β€ŠβŸΉβ€…β€Š\implies x0x_{0} to the left of xβˆ—x* β€…β€ŠβŸΉβ€…β€Š\implies We need to move x0x_{0} to the right.

The steps

Now for the second question, How much should we move x0x_{0} ? Take a look at the following examples:

Graph 4 Graph 5

We can conclude that:

  • x0x_{0} is close to xβˆ—x* => The slope of the tangent is small => fβ€²(x0)f'(x_{0}) is small.
  • x0x_{0} is distant from xβˆ—x* => The slope of the tangent is big => fβ€²(x0)f'(x_{0}) is big.

By answering both questions, we concluded that only the knowledge of the derivative in the point x0x_{0} can give us a lot of insight about the direction and the distance of the optimal point x0x_{0}.

Gradient descent

Gradient descent is the formulation of the answers of the previous two questions. It's an optimization iterative algorithm that approximates the minimum xβˆ—x* of function starting from a random initial point x0x_{0}. The algorithm is stated as follow:

xβˆ—n+1=xβˆ—nβˆ’lrΓ—dfdx_nx*{n+1} = x*{n} - lr \times \frac{\mathrm{d} f}{\mathrm{d} x\_{n}}

where:

  • dfdxβˆ—n\frac{\mathrm{d} f}{\mathrm{d} x*{n}} is no more than the derivative of ff in the point xβˆ—nx*{n}.
  • lrlr is a positive constant that determines how big the steps are going to be.

Notice that:

  • xnx_{n} is to the right of xβˆ—x* => dfdxn>0\frac{\mathrm{d} f}{\mathrm{d} x_{n}} > 0 => xn+1=xnβˆ’positivex_{n+1} = x_{n} - positive => xnx_{n} moves to the left.
  • xnx_{n} is to the left of xβˆ—x* => dfdxn<0\frac{\mathrm{d} f}{\mathrm{d} x_{n}} < 0 => xβˆ—n+1=xβˆ—n+positivex*{n+1} = x*{n} + positive => xnx_{n} moves to the right.
  • xnx_{n} close to xβˆ—x* => dfdxn\frac{\mathrm{d} f}{\mathrm{d} x_{n}} close to 00 => Small update to xnx_{n}.

Quizz

  • When does gradient descent stop iterating:
    • When xnx_{n} is small enough.
    • When xnx_{n} is close to x0x_{0} .
    • When dfdx_n=0\frac{\mathrm{d} f}{\mathrm{d} x\_{n}} = 0 . XXX
  • How do we choose x0x_{0}:
    • We pick it randomly. XXX
    • We take it in the neighborhood of xnx_{n}.
    • It depends on the problem.
  • Why do we need gradient descent:
    • Because computers are not powerful enough to calculate derivatives.
    • Because it's extremely hard to find the derivatives formulas of deep learning models. XXX
    • Because functions have more than one local minimum.

Our website uses cookies and similar technologies to personalize your experience, offer sign-on options, and to analyze our traffic. See our Privacy Policy for more info.