How does gradient descent work?

This is the companion website for the paper Understanding Optimization in Deep Learning with Central Flows, published at ICLR 2025.

Part I: how does gradient descent work?

The simplest optimization algorithm is deterministic gradient descent:

\[ w_{t+1} = w_t - \eta \,

abla L(w_t) \]

Perhaps surprisingly, traditional analyses of gradient descent cannot capture the typical dynamics of gradient descent in deep learning. We'll first explain why, and then we'll present a new analysis of gradient descent that does apply in deep learning.

The dynamics of gradient descent

Let's start with the picture that everyone has likely seen before. Suppose that we run gradient descent on a quadratic function \( \frac{1}{2} S x^2\), i.e. a smiley-face parabola. The parameter \(S\) controls the second derivative ("curvature") of the parabola: when \(S\) is larger, the parabola is steeper.

If we run gradient descent on this function with learning rate \(\eta\), there are two possible outcomes. On the one hand, if \(S < 2/\eta\), then the parabola is "flat enough" for the learning rate \(\eta\), and gradient descent will converge. On the other hand, if \(S >2> 2/\eta\), then the parabola is "too sharp" for the learning rate \(\eta\), and gradient descent will oscillate back and forth with increasing magnitude.

Your browser does not support the video tag. Click to play

... continue reading