Hamilton-Jacobi-Bellman Equation: Reinforcement Learning and Diffusion Models

Why the HJB is Bellman's equation in continuous time, why continuous time matters, and how to solve the resulting control problem with neural policy iteration.

Machine learning feels recent, but one of its core mathematical ideas dates back to 1952, when Richard Bellman published a seminal paper titled “On the Theory of Dynamic Programming” [6, 7], laying the foundation for optimal control and what we now call reinforcement learning.

Later in the 50s, Bellman extended his work to continuous-time systems, turning the optimal condition into a PDE. What he later found was that this was identical to a result in physics published a century before (1840s), known as the Hamilton-Jacobi equation.

Once that structure is visible, several topics line up naturally:

continuous-time reinforcement learning

stochastic control

diffusion models

optimal transport

In this post I want to turn our attention to two applications of Bellman’s work: continuous-time reinforcement learning, and how the training of generative models (diffusion models) can be interpreted through stochastic optimal control

Bellman originally formulated dynamic programming in discrete time in the early 1950s [6, 7]. Consider a Markov decision process with state space $\mathcal X$, action space $\mathcal A$, transition kernel $P(\cdot\mid x,a)$, reward function $r(x,a)$, and discount factor $\gamma\in(0,1)$. A policy $\pi$ maps each state to a distribution over actions. If the state evolves as a controlled Markov chain

... continue reading