Modular Manifolds

77 777777 777777777777 777777777777777777 7777777777777777777777777 777777777777777777777777777777 77777777777777777777777777777777777777 77777777777777777777777777777777777777777777 177777777777777777777777777777777777777777777772 445377777777777777777777777777777777777777775800 444445377777777777777777777777777777777715000000 444444445277777777777777777777777777716800000000 444444444444217777777777777777777736000000000000 444444444444444517777777777777739000000000000000 444444444444444444517777777728000000000000000000 444444444444444444444537748000000000000000000000 444444444444444444444446000000000000000000000000 444444444444444444444446000000000000000000000000 444444444444444444444446000000000000000000000000 444444444444444444444446000000000000000000000000 444444444444444444444446000000000000000000000000 444444444444444444444446000000000000000000000000 444444444444444444444446000000000000000000000000 444444444444444444444446000000000000000000000000 44444444444444444444460000000000000000000000 44444444444444444460000000000000000000 444444444444446000000000000000 44444444444460000000000000 44444444446000000000 444446000000 446800 40 4444 77 777777 777777777777 777777777777777777 7777777777777777777777777 777777777777777777777777777777 77777777777777777777777777777777777777 77777777777777777777777777777777777777777777 177777777777777777777777777777777777777777777772 445377777777777777777777777777777777777777775800 444445377777777777777777777777777777777715000000 444444445277777777777777777777777777716800000000 444444444444217777777777777777777736000000000000 444444444444444517777777777777739000000000000000 444444444444444444517777777728000000000000000000 444444444444444444444537748000000000000000000000 444444444444444444444446000000000000000000000000 444444444444444444444446000000000000000000000000 444444444444444444444446000000000000000000000000 444444444444444444444446000000000000000000000000 444444444444444444444446000000000000000000000000 444444444444444444444446000000000000000000000000 444444444444444444444446000000000000000000000000 444444444444444444444446000000000000000000000000 44444444444444444444460000000000000000000000 44444444444444444460000000000000000000 444444444444446000000000000000 44444444444460000000000000 44444444446000000000 444446000000 446800 40 4444 77 777777 777777777777 777777777777777777 7777777777777777777777777 777777777777777777777777777777 77777777777777777777777777777777777777 77777777777777777777777777777777777777777777 177777777777777777777777777777777777777777777772 445377777777777777777777777777777777777777775800 444445377777777777777777777777777777777715000000 444444445277777777777777777777777777716800000000 444444444444217777777777777777777736000000000000 444444444444444517777777777777739000000000000000 444444444444444444517777777728000000000000000000 444444444444444444444537748000000000000000000000 444444444444444444444446000000000000000000000000 444444444444444444444446000000000000000000000000 444444444444444444444446000000000000000000000000 444444444444444444444446000000000000000000000000 444444444444444444444446000000000000000000000000 444444444444444444444446000000000000000000000000 444444444444444444444446000000000000000000000000 444444444444444444444446000000000000000000000000 44444444444444444444460000000000000000000000 44444444444444444460000000000000000000 444444444444446000000000000000 44444444444460000000000000 44444444446000000000 444446000000 446800 40 4444

When we train large neural networks, we need to keep them healthy. We do not want the tensors in the network—either the weights, activations or gradients—to grow too large or too small. Very small and very large tensors cause a variety of problems not just limited to numerical underflow and overflow. For example, weight matrices changing size during training makes it harder to design training algorithms—since the relative size of updates to weights has a significant impact on the speed of learning.

The gold standard for keeping tensors healthy is to normalize them. Normalization is commonplace for activation vectors, where we use techniques like layer norm to put the activations on a good scale before passing them to the next layer. It is also commonplace to normalize gradient updates, where we can interpret fast training algorithms like the Muon optimizer as spectrally normalizing the updates. Normalization provides us with certainty about the sizes of tensors—without needing to check Wandb!—and when training large neural networks with many interacting components, having certainty about the network internals is valuable.

Normalization is less commonly applied to weight matrices, although it is not unheard of. For example, the EDM2 diffusion model codebase uses weight constraints and the authors report benefits in their paper. Various other techniques have been proposed but are not common practice in modern large-scale training. For some examples, see Salimans et al, 2016, Miyato et al, 2018 and our paper Liu et al, 2021. Normalizing the weight matrices might be a good idea for a few reasons. Weight constraints make understanding the relative size of optimization updates easier. They remove the problem of weight norms exploding. They allow us to focus hyperparameter tuning effort on tensors whose size matters most. They can force matrices to have a small condition number, making their behaviour more predictable. And relatedly, weight constraints facilitate Lipschitz guarantees for robustness to perturbations.

This post covers one appealing way to constrain the weight matrices of a neural network—by keeping the tensors constrained to submanifolds at each layer. This opens the door to re-thinking optimization, as we can co-design optimization algorithms with these manifold constraints. As an example, we propose This algorithm builds on work from Jianlin Su and Franz Louis Cesista, as discussed further below. a manifold version of the Muon optimizer whose weights are constrained to the Stiefel manifold: the manifold of matrices with unit condition number. We conclude the post by defining the idea of a modular manifold, which is a composable manifold that attempts to make it easier to scale up and train large networks.

Our goal in writing this post is to provide an introduction to a research area that we are excited about, and highlight many directions for future work. We would love to see more work from the community on the topics mentioned at the end of the post!

The shape of a manifold optimizer

This section works through the simplest example of learning on a manifold: a vector parameter constrained to a hypersphere in $\mathbb{R}^d$. The vector parameter is trained to minimize a loss function defined over the full space $\mathbb{R}^d$. This setup might be useful for, say, individual embedding vectors in a transformer model. This section will be a good warmup for the following section on manifold Muon that considers matrix parameters.

We will not be too formal about the definition of a manifold here: it is enough to understand that a manifold is a curved surface that looks flat when you zoom in close enough. The locally flat approximation at a point on the manifold is called the tangent space to the manifold, as visualized in Figure :

The sphere in three dimensions—or the hypersphere in higher dimensions—is a manifold. The locally flat approximation at a point on the manifold is called the tangent space to the manifold and is visualized as the red plane in the figure.

... continue reading