Deriving Muon
Published on: 2025-06-26 12:43:09
Deriving Muon
Particle tracks in a bubble chamber. Fermilab.
We recently proposed Muon: a new neural net optimizer. Muon has garnered attention for its excellent practical performance: it was used to set NanoGPT speed records leading to interest from the big labs.
What makes Muon particularly special to me is that we derived the core numerical methods from an exact theoretical principle. This is in contrast to popular optimizers like Adam, which have more heuristic origins and often converge slower than Muon. In this post, I will walk through a derivation of Muon. I hope this will provide context that may help researchers extend the methods to new layer types and beyond.
While this post focuses on the theory behind Muon, I recommend checking out Keller’s post to learn more about the algorithm—including the substantial ingenuity that went into making the implementation run fast.
📘 Muon is joint work with Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista and Laker N
... Read full article.