Gradient Descent Explained for Machine Learning Beginners
Updated on December 10, 2025 9 minutes read
In 2026, gradient descent is still one of the core algorithms used to train machine learning and deep learning models. It is the workhorse that helps us minimise loss functions when an exact analytical solution is hard or impossible to obtain.
Imagine we have a function , and we want to find a value of that minimises it. In simple cases, we can take the derivative, set , and solve for . In many realistic models, especially deep neural networks, computing and solving those equations exactly is not feasible.
We therefore look for an iterative method that can walk us down the curve of the function until we get close to a minimum. Gradient descent is exactly that: a simple rule that repeatedly updates our current guess using information from the slope of the function.
Why not just solve the derivative?
For small textbook functions, we can usually compute derivatives symbolically and solve . Real-world machine learning models, however, can have millions of parameters and highly non-linear structures.
Even if automatic differentiation can compute derivatives numerically, solving the equation in closed form is still intractable. Instead, we use gradient descent to move step by step in the direction that most reduces the function value.
This idea generalises nicely. Once we understand it in one dimension, the same logic extends to many dimensions, where the gradient becomes a vector of partial derivatives.
Building intuition in one dimension
To build intuition, start with a one-dimensional function and its graph. Suppose we pick a starting point somewhere on this curve.
Our goal is to move closer and closer to the point where the function reaches a local minimum. At that point, the slope is zero, so .
Two questions naturally arise when we try to move toward :
- In which direction should we move the current point ? Left or right?
- How large should each step be so that we move efficiently without overshooting?
Direction: Should we move left or right?
The sign of the derivative tells us in which direction the function is increasing. The derivative at a point is the slope of the tangent line to the curve at that point.
If the slope (derivative) is positive, the tangent line goes up as we move to the right.
If the slope (derivative) is negative, the tangent line goes down as we move to the right.
For a typical U-shaped function with a minimum at :
When is to the right of , the slope is positive, so .
When is to the left of , the slope is negative, so .
This means:
If , we should move to the left to go downhill.
If , we should move to the right to go downhill.
The key insight is that the sign of the derivative alone already tells us which direction to move to reduce the function value.
The tangent line and local slope
The tangent line at a point is a linear approximation of the function near that point. Its equation is
.
The coefficient is exactly the slope of the tangent. This slope is what gradient descent uses to decide how to update our current guess.
Knowing the derivative at a point,, therefore, gives us local information about both the direction and, roughly, how far the minimum might be.
Step size: how far should we move?
The magnitude of the derivative reflects how steep the curve is at . The steeper the curve, the larger the absolute value of the slope.
If is close to , the slope is small, so is small.
If is far from , the slope is larger, so tends to be larger.
Intuitively, when the slope is large, we can afford to take a bigger step, because we are far away from the minimum. As we get closer, we want smaller steps to avoid bouncing around the minimum.
Gradient descent formalises this idea by combining the derivative with a parameter called the learning rate.
The gradient descent update rule
One-dimensional case
Gradient descent is an iterative optimisation algorithm that updates our current estimate step by step. In one dimension, the update rule is
.
Here:
is our current point at iteration .
is the derivative of evaluated at .
is the learning rate, a number that controls how big each step is.
Notice what this rule does at each iteration:
If (we are to the right of a local minimum for a U-shaped function), we subtract a positive quantity and move to the left.
If , the product is negative, so we subtract a negative number and move to the right.
As we get closer to a minimum, approaches zero, so the updates become smaller and smaller.
This simple rule gives us a powerful way to approximate a minimum without ever having to solve explicitly.
From one dimension to many
In higher dimensions, we move from a scalar to a parameter vector . The derivative generalises to the gradient, written , which is a vector of partial derivatives.
The update rule becomes
.
The same intuition applies in many dimensions:
The gradient points in the direction of the steepest increase of the function. The negative gradient points in the direction of steepest decrease, which is where we want to go.
This is exactly how modern deep learning models update millions of parameters during training.
Choosing a learning rate
The learning rate is one of the most important hyperparameters in gradient descent. It controls how aggressive or cautious each update is.
If is too large, the algorithm can overshoot the minimum and diverge.
If is too small, gradient descent will move very slowly and may take a long time to converge.
In practice, we often start with a moderate learning rate and adjust it based on how the training behaves. Many optimisers in deep learning libraries automatically adapt the effective step size during training.
A common workflow is to try a few candidate learning rates, watch how the loss curve evolves, and then refine the choice based on empirical behaviour.
When does gradient descent stop?
In theory, gradient descent could continue iterating forever, updating again and again. In practice, we need a stopping rule that tells us when we are close enough to a minimum.
Common stopping criteria include:
The derivative or gradient becomes very small, for example .
The change in the parameters is very small, such as below a chosen threshold.
The change in the function value from one step to the next is tiny. A maximum number of iterations or passes through the data (epochs) is reached.
In other words, gradient descent stops when further updates are unlikely to improve the solution significantly, or when we hit a time or compute budget.
How do we choose the starting point ?
In many optimisation problems, we do not know where the minimum is, so we simply choose a starting point at random. This is especially common in high-dimensional models.
For simple convex functions, any reasonable starting point will eventually lead to the same global minimum. For more complex functions with multiple local minima, the choice of can affect which minimum we find.
In deep learning, initialising parameters randomly and running gradient-based optimisation multiple times is standard practice. Careful initialisation can sometimes improve convergence speed and stability.
Why gradient descent matters in deep learning
Gradient descent is essential in modern machine learning and deep learning because:
It is often extremely hard or impossible to derive and solve the exact equations for the minima of complex models.
Models can have millions or even billions of parameters, which makes closed-form optimisation impractical.
Gradient-based methods scale well with data size and model complexity, especially with automatic differentiation and hardware acceleration.
Rather than solving a huge system of equations, we repeatedly apply a simple update rule that nudges the parameters in a better direction. This idea underlies almost every training loop you see in practice.
Quick quiz: test your understanding
Use this quick quiz to check your intuition about gradient descent.
1. When does gradient descent stop iterating?
- a) When is small enough.
- b) When is close to the initial value .
- c) When the derivative is approximately zero or another stopping criterion is met.
Answer: In practice, we stop when the gradient is close to zero, the parameter changes are tiny, or we reach a maximum number of iterations. OptionCc captures this idea.
2. How do we choose the initial point ?
- a) We always pick it randomly.
- b) We always take it in the neighbourhood of the true minimum.
- c) We often pick it randomly, but the best choice can depend on the problem.
Answer: In many machine learning problems, we initialise parameters randomly because we do not know where the minimum is. However, domain knowledge or better initialisation schemes can sometimes improve convergence, so option C is the most accurate.
3. Why do we need gradient descent in deep learning?
- a) Because computers are not powerful enough to calculate derivatives.
- b) Because it is extremely hard to derive and solve exact formulas for the minima of deep learning models.
- c) Because functions always have more than one local minimum.
Answer: Automatic differentiation makes computing derivatives feasible, but analytically solving for the minima of deep networks is not. Gradient descent and its variants give us a practical way to optimise such models, so option B is correct.
Simple pseudocode example
Here is a minimal pseudocode outline of gradient descent for a one-dimensional function. In a real project, you would use a library to compute gradients, but the logic is the same.
# Gradient descent in 1D
x = x0 # initial guess
lr = 0.1 # learning rate
max_iters = 1000
for n in range(max_iters):
grad = f_prime(x) # compute derivative at current x
x = x - lr * grad # update step
if abs(grad) < 1e-6:
break # stopping criterion
print("Approximate minimum at x =", x)
In practice, we replace f_prime with an automatic differentiation call and extend the variable x to a vector of parameters. The structure of the loop and the update rule stays the same in both the one-dimensional and multi-dimensional settings.
Next steps
If you found this explanation helpful and want to go deeper into optimisation, neural networks, and real-world projects, consider joining the Data Science and AI bootcamp at Code Labs Academy.
You can also explore more articles and guides on the Code Labs Academy blog to strengthen your foundations in machine learning, programming, and data skills.