Skip to content
Tech News
← Back to articles

Hamilton-Jacobi-Bellman Equation: Reinforcement Learning and Diffusion Models

read original get Reinforcement Learning Textbook → more articles
Why This Matters

This article highlights the importance of the Hamilton-Jacobi-Bellman (HJB) equation in advancing continuous-time reinforcement learning and generative diffusion models. Understanding this connection enables the development of more sophisticated control algorithms and improves the training of AI models, impacting both industry applications and consumer technologies.

Key Takeaways

Why the HJB is Bellman's equation in continuous time, why continuous time matters, and how to solve the resulting control problem with neural policy iteration.

Machine learning feels recent, but one of its core mathematical ideas dates back to 1952, when Richard Bellman published a seminal paper titled “On the Theory of Dynamic Programming” [6, 7], laying the foundation for optimal control and what we now call reinforcement learning.

Later in the 50s, Bellman extended his work to continuous-time systems, turning the optimal condition into a PDE. What he later found was that this was identical to a result in physics published a century before (1840s), known as the Hamilton-Jacobi equation.

Once that structure is visible, several topics line up naturally:

continuous-time reinforcement learning

stochastic control

diffusion models

optimal transport

In this post I want to turn our attention to two applications of Bellman’s work: continuous-time reinforcement learning, and how the training of generative models (diffusion models) can be interpreted through stochastic optimal control

Bellman originally formulated dynamic programming in discrete time in the early 1950s [6, 7]. Consider a Markov decision process with state space $\mathcal X$, action space $\mathcal A$, transition kernel $P(\cdot\mid x,a)$, reward function $r(x,a)$, and discount factor $\gamma\in(0,1)$. A policy $\pi$ maps each state to a distribution over actions. If the state evolves as a controlled Markov chain

... continue reading