Imagine that we have a function and we would like to find its minimum. What would you do ?
Simple right ? We only need to solve the following equation:
The thing is that finding the formula of is not always easy as they tend to be extremely complicated especially in deep learning where we deal with complex functions. So we need to find another method that can provide us the minimum of a function without the need of finding the formula of the derivative .
Let's build some intuition
Let's suppose that we have a function f with the corresponding graph:
Let's start with a random point . The goal is to move this point and make it closer and closer to such that x*. So the problem can be divided in two parts:
- In which direction should we move the point ? Left or Right ?
- How much should we move it ?
Let’s build some intuition in order to answer the first question. Take a look at the following point:
- when the point is to the right of the optimal point its tangent line goes up.
- when the point is to the right of the optimal point its tangent line goes down.
The direction of a line is determined by the sign of its slope:
- A line goes up the slope is positive.
- A line goes down the slope is negative.
The slope of the tangent line of a function in a certain point is no more than the derivative in that point :
So as an answer to the question "Where should we move ?":
- to the right of We need to move to the left.
- to the left of We need to move to the right.
Now for the second question, How much should we move ? Take a look at the following examples:
We can conclude that:
- is close to => The slope of the tangent is small => is small.
- is distant from => The slope of the tangent is big => is big.
By answering both questions, we concluded that only the knowledge of the derivative in the point can give us a lot of insight about the direction and the distance of the optimal point .
Gradient descent is the formulation of the answers of the previous two questions. It's an optimization iterative algorithm that approximates the minimum of function starting from a random initial point . The algorithm is stated as follow:
- is no more than the derivative of in the point .
- is a positive constant that determines how big the steps are going to be.
- is to the right of => => => moves to the left.
- is to the left of => => => moves to the right.
- close to => close to => Small update to .
- When does gradient descent stop iterating:
- When is small enough.
- When is close to .
- When . XXX
- How do we choose :
- We pick it randomly. XXX
- We take it in the neighborhood of .
- It depends on the problem.
- Why do we need gradient descent:
- Because computers are not powerful enough to calculate derivatives.
- Because it's extremely hard to find the derivatives formulas of deep learning models. XXX
- Because functions have more than one local minimum.