Skip to content
Tech News
← Back to articles

The quadratic sandwich

read original more articles
Why This Matters

Understanding the quadratic sandwich, which combines strong convexity and L-smoothness, is crucial for optimizing functions efficiently in machine learning. These properties determine how well gradient-based methods perform, impacting both algorithm design and convergence guarantees for consumers and developers alike. Recognizing and verifying these properties helps improve optimization strategies and ensures more reliable model training.

Key Takeaways

The quadratic sandwich

If you have ever tried to minimize a function with gradient descent, you probably noticed that some functions are a joy to optimize and others are a nightmare. The difference often boils down to two properties: strong convexity and L-smoothness. These two concepts define a “sandwich” of quadratic bounds around your function that tells you exactly how well-behaved it is. If the sandwich is tight, life is good. If one slice of bread is missing, things get ugly fast.

In this post we’ll build up both concepts from scratch, see how they combine into the quadratic sandwich, understand what happens at the level of the Hessian’s eigenvalues, and pick up a neat trick to verify L-smoothness without ever computing an eigenvalue.

A differentiable function \(f:\mathbb{R}^n\to\mathbb{R}\) is \(\mu\)-strongly convex (with \(\mu > 0\)) if for all \(x, y\)

If this looks familiar, it’s because the first two terms on the right are the first-order Taylor expansion of \(f\) at \(x\). For a plain convex function, the Taylor expansion is already a global underestimator (that’s the subgradient inequality). But strong convexity asks for more: the function must stay above the tangent plus a quadratic gap. The parameter \(\mu\) controls how aggressive this gap is — the bigger \(\mu\), the more the function curves upward and away from its linear approximation.

The intuition is that a strongly convex function has a guaranteed minimum curvature of \(\mu\) in every direction. It can’t flatten out, it can’t plateau, it can’t have a degenerate valley where one direction is basically flat. There is always a force pulling you toward the minimum, and that force grows linearly with the distance from the minimizer.

A differentiable function \(f\) is \(L\)-smooth if its gradient is Lipschitz continuous:

Read this carefully: the change in the gradient between any two points is always dominated by a rescaled version of the change in the input. No matter how far apart \(x\) and \(y\) are, the gradient difference \(\|

abla f(x) -

abla f(y)\|\) can never outpace \(L\) times the input difference \(\|x - y\|\). The constant \(L\) acts as a leash on the gradient: it can move, but it can’t jerk. No abrupt turns, no sudden spikes in curvature.

... continue reading