Gradient Descent Explained for Machine Learning Beginners

Updated on December 10, 2025 9 minutes read


In 2026, gradient descent is still one of the core algorithms used to train machine learning and deep learning models. It is the workhorse that helps us minimise loss functions when an exact analytical solution is hard or impossible to obtain.

Imagine we have a function f(x)f(x), and we want to find a value of xx that minimises it. In simple cases, we can take the derivative, set f(x)=0f'(x) = 0, and solve for xx. In many realistic models, especially deep neural networks, computing and solving those equations exactly is not feasible.

We therefore look for an iterative method that can walk us down the curve of the function until we get close to a minimum. Gradient descent is exactly that: a simple rule that repeatedly updates our current guess using information from the slope of the function.

Why not just solve the derivative?

For small textbook functions, we can usually compute derivatives symbolically and solve f(x)=0f'(x) = 0. Real-world machine learning models, however, can have millions of parameters and highly non-linear structures.

Even if automatic differentiation can compute derivatives numerically, solving the equation f(x)=0f'(x) = 0 in closed form is still intractable. Instead, we use gradient descent to move step by step in the direction that most reduces the function value.

This idea generalises nicely. Once we understand it in one dimension, the same logic extends to many dimensions, where the gradient becomes a vector of partial derivatives.

Building intuition in one dimension

To build intuition, start with a one-dimensional function ff and its graph. Suppose we pick a starting point x0x_0 somewhere on this curve.

Our goal is to move x0x_0 closer and closer to the point xx^* where the function reaches a local minimum. At that point, the slope is zero, so f(x)=0f'(x^*) = 0.

Two questions naturally arise when we try to move toward xx^*:

  • In which direction should we move the current point xnx_n? Left or right?
  • How large should each step be so that we move efficiently without overshooting?

Direction: Should we move left or right?

The sign of the derivative tells us in which direction the function is increasing. The derivative at a point x0x_0 is the slope of the tangent line to the curve at that point.

If the slope (derivative) is positive, the tangent line goes up as we move to the right.

If the slope (derivative) is negative, the tangent line goes down as we move to the right.

For a typical U-shaped function with a minimum at xx^*:

When x0x_0 is to the right of xx^*, the slope is positive, so f(x0)>0f'(x_0) > 0.

When x0x_0 is to the left of xx^*, the slope is negative, so f(x0)<0f'(x_0) < 0.

This means:

If f(x0)>0f'(x_0) > 0, we should move x0x_0 to the left to go downhill.

If f(x0)<0f'(x_0) < 0, we should move x0x_0 to the right to go downhill.

The key insight is that the sign of the derivative alone already tells us which direction to move to reduce the function value.

The tangent line and local slope

The tangent line at a point x0x_0 is a linear approximation of the function ff near that point. Its equation is

tangentx0(x)=f(x0)(xx0)+f(x0)\text{tangent}_{x_0}(x) = f'(x_0) (x - x_0) + f(x_0).

The coefficient f(x0)f'(x_0) is exactly the slope of the tangent. This slope is what gradient descent uses to decide how to update our current guess.

Knowing the derivative at a point,, therefore, gives us local information about both the direction and, roughly, how far the minimum might be.

Step size: how far should we move?

The magnitude of the derivative f(x0)\lvert f'(x_0) \rvert reflects how steep the curve is at x0x_0. The steeper the curve, the larger the absolute value of the slope.

If x0x_0 is close to xx^*, the slope is small, so f(x0)\lvert f'(x_0) \rvert is small.

If x0x_0 is far from xx^*, the slope is larger, so f(x0)\lvert f'(x_0) \rvert tends to be larger.

Intuitively, when the slope is large, we can afford to take a bigger step, because we are far away from the minimum. As we get closer, we want smaller steps to avoid bouncing around the minimum.

Gradient descent formalises this idea by combining the derivative with a parameter called the learning rate.

The gradient descent update rule

One-dimensional case

Gradient descent is an iterative optimisation algorithm that updates our current estimate step by step. In one dimension, the update rule is

xn+1=xnlrf(xn)x_{n+1} = x_n - \text{lr} \cdot f'(x_n).

Here:

xnx_n is our current point at iteration nn.

f(xn)f'(x_n) is the derivative of ff evaluated at xnx_n.

lr>0\text{lr} > 0 is the learning rate, a number that controls how big each step is.

Notice what this rule does at each iteration:

If f(xn)>0f'(x_n) > 0 (we are to the right of a local minimum for a U-shaped function), we subtract a positive quantity and move xn+1x_{n+1} to the left.

If f(xn)<0f'(x_n) < 0, the product lrf(xn)\text{lr} \cdot f'(x_n) is negative, so we subtract a negative number and move xn+1x_{n+1} to the right.

As we get closer to a minimum, f(xn)f'(x_n) approaches zero, so the updates become smaller and smaller.

This simple rule gives us a powerful way to approximate a minimum without ever having to solve f(x)=0f'(x) = 0 explicitly.

From one dimension to many

In higher dimensions, we move from a scalar xx to a parameter vector w\mathbf{w}. The derivative generalises to the gradient, written f(w)\nabla f(\mathbf{w}), which is a vector of partial derivatives.

The update rule becomes

wn+1=wnlrf(wn)\mathbf{w}_{n+1} = \mathbf{w}_n - \text{lr} \cdot \nabla f(\mathbf{w}_n).

The same intuition applies in many dimensions:

The gradient points in the direction of the steepest increase of the function. The negative gradient f(wn)-\nabla f(\mathbf{w}_n) points in the direction of steepest decrease, which is where we want to go.

This is exactly how modern deep learning models update millions of parameters during training.

Choosing a learning rate

The learning rate lr\text{lr} is one of the most important hyperparameters in gradient descent. It controls how aggressive or cautious each update is.

If lr\text{lr} is too large, the algorithm can overshoot the minimum and diverge.

If lr\text{lr} is too small, gradient descent will move very slowly and may take a long time to converge.

In practice, we often start with a moderate learning rate and adjust it based on how the training behaves. Many optimisers in deep learning libraries automatically adapt the effective step size during training.

A common workflow is to try a few candidate learning rates, watch how the loss curve evolves, and then refine the choice based on empirical behaviour.

When does gradient descent stop?

In theory, gradient descent could continue iterating forever, updating xnx_n again and again. In practice, we need a stopping rule that tells us when we are close enough to a minimum.

Common stopping criteria include:

The derivative or gradient becomes very small, for example f(xn)0\lvert f'(x_n) \rvert \approx 0.

The change in the parameters is very small, such as xn+1xn\lvert x_{n+1} - x_n \rvert below a chosen threshold.

The change in the function value f(xn)f(x_n) from one step to the next is tiny. A maximum number of iterations or passes through the data (epochs) is reached.

In other words, gradient descent stops when further updates are unlikely to improve the solution significantly, or when we hit a time or compute budget.

How do we choose the starting point x0x_0?

In many optimisation problems, we do not know where the minimum is, so we simply choose a starting point x0x_0 at random. This is especially common in high-dimensional models.

For simple convex functions, any reasonable starting point will eventually lead to the same global minimum. For more complex functions with multiple local minima, the choice of x0x_0 can affect which minimum we find.

In deep learning, initialising parameters randomly and running gradient-based optimisation multiple times is standard practice. Careful initialisation can sometimes improve convergence speed and stability.

Why gradient descent matters in deep learning

Gradient descent is essential in modern machine learning and deep learning because:

It is often extremely hard or impossible to derive and solve the exact equations for the minima of complex models.

Models can have millions or even billions of parameters, which makes closed-form optimisation impractical.

Gradient-based methods scale well with data size and model complexity, especially with automatic differentiation and hardware acceleration.

Rather than solving a huge system of equations, we repeatedly apply a simple update rule that nudges the parameters in a better direction. This idea underlies almost every training loop you see in practice.

Quick quiz: test your understanding

Use this quick quiz to check your intuition about gradient descent.

1. When does gradient descent stop iterating?

  • a) When xnx_n is small enough.
  • b) When xnx_n is close to the initial value x0x_0.
  • c) When the derivative f(xn)f'(x_n) is approximately zero or another stopping criterion is met.

Answer: In practice, we stop when the gradient is close to zero, the parameter changes are tiny, or we reach a maximum number of iterations. OptionCc captures this idea.

2. How do we choose the initial point x0x_0?

  • a) We always pick it randomly.
  • b) We always take it in the neighbourhood of the true minimum.
  • c) We often pick it randomly, but the best choice can depend on the problem.

Answer: In many machine learning problems, we initialise parameters randomly because we do not know where the minimum is. However, domain knowledge or better initialisation schemes can sometimes improve convergence, so option C is the most accurate.

3. Why do we need gradient descent in deep learning?

  • a) Because computers are not powerful enough to calculate derivatives.
  • b) Because it is extremely hard to derive and solve exact formulas for the minima of deep learning models.
  • c) Because functions always have more than one local minimum.

Answer: Automatic differentiation makes computing derivatives feasible, but analytically solving for the minima of deep networks is not. Gradient descent and its variants give us a practical way to optimise such models, so option B is correct.

Simple pseudocode example

Here is a minimal pseudocode outline of gradient descent for a one-dimensional function. In a real project, you would use a library to compute gradients, but the logic is the same.

# Gradient descent in 1D
x = x0                  # initial guess
lr = 0.1                # learning rate
max_iters = 1000

for n in range(max_iters):
    grad = f_prime(x)   # compute derivative at current x
    x = x - lr * grad   # update step

    if abs(grad) < 1e-6:
        break           # stopping criterion

print("Approximate minimum at x =", x)

In practice, we replace f_prime with an automatic differentiation call and extend the variable x to a vector of parameters. The structure of the loop and the update rule stays the same in both the one-dimensional and multi-dimensional settings.

Next steps

If you found this explanation helpful and want to go deeper into optimisation, neural networks, and real-world projects, consider joining the Data Science and AI bootcamp at Code Labs Academy.

You can also explore more articles and guides on the Code Labs Academy blog to strengthen your foundations in machine learning, programming, and data skills.

Frequently Asked Questions

What is gradient descent in simple terms?

Gradient descent is an iterative optimisation algorithm that repeatedly moves parameters in the direction that most reduces a loss function. It uses the derivative (or gradient) to decide which way to move and how big each step should be.

How is gradient descent used in deep learning?

Deep learning models define a loss function that measures how well the model fits the data. During training, gradient-based optimisers compute gradients of this loss with respect to the model parameters and apply update rules like gradient descent or Adam to reduce the loss over many iterations.

What is a good learning rate for gradient descent?

There is no single best learning rate. If it is too large, the algorithm can diverge or oscillate; if it is too small, training will be very slow. In practice, we choose a reasonable starting value, monitor training behaviour, and adjust or schedule the learning rate based on experiments.

Career Services

Personalized career support to help you launch your tech career. Get résumé reviews, mock interviews, and industry insights—so you can showcase your new skills with confidence.