PyTorch Monarch - GoKawiil

We now live in a world where ML workflows (pre-training, post training, etc) are heterogeneous, must contend with hardware failures, are increasingly asynchronous and highly dynamic. Traditionally, PyTorch has relied on an HPC-style multi-controller model, where multiple copies of the same script are launched across different machines, each running its own instance of the application (often referred to as SPMD). ML workflows are becoming more complex: pre-training might combine advanced parallelism with asynchrony and partial failure; while RL models used in post-training require a high degree of dynamism with complex feedback loops. While the logic of these workflows may be relatively straightforward, they are notoriously difficult to implement well in a multi-controller system, where each node must decide how to act based on only a local view of the workflow’s state.

We believe that the long-term sustainable way to address this is through a single controller programming model, in which a single script orchestrates all distributed resources, making them feel almost local. This architectural shift simplifies distributed programming—your code looks and feels like a single-machine Python program, but can scale across thousands of GPUs. You can directly use Pythonic constructs—classes, functions, loops, tasks, futures—to express complex distributed algorithms.

We’re excited to introduce Monarch, a distributed programming framework that brings the simplicity of single-machine PyTorch to entire clusters.

Monarch lets you program distributed systems the way you’d program a single machine, hiding the complexity of distributed computing:

Program clusters like arrays. Monarch organizes hosts, processes, and actors into scalable meshes that you can manipulate directly. You can operate on entire meshes (or slices of them) with simple APIs—Monarch handles the distribution and vectorization automatically, so you can think in terms of what you want to compute, not where the code runs. Progressive fault handling. With Monarch, you write your code as if nothing fails. When something does fail, Monarch fails fast by default—stopping the whole program, just like an uncaught exception in a simple local script. Later, you can progressively add fine-grained fault handling exactly where you need it, catching and recovering from failures just like you’d catch exceptions. Separate control from data. Monarch splits control plane (messaging) from data plane (RDMA transfers), enabling direct GPU-to-GPU memory transfers across your cluster. Monarch lets you send commands through one path, while moving data through another, optimized for what each does best. Distributed tensors that feel local. Monarch integrates seamlessly with PyTorch to provide tensors that are sharded across clusters of GPUs. Monarch tensor operations look local but are executed across distributed large clusters, with Monarch handling the complexity of coordinating across thousands of GPUs.

Programming Model

Key APIs: Process and Actor Meshes

Monarch organizes resources into multidimensional arrays, or meshes. A process mesh is an array of processes spread across many hosts; an actor mesh is an array of actors, each running inside a separate process. Like array programming in NumPy or PyTorch, meshes make it simple to dispatch operations efficiently across large systems.

At launch, Monarch supports process meshes over GPU clusters—typically one process per GPU—onto which you can spawn actors into actor meshes. For local development, the same meshes can also run on a local development server.

Advanced APIs: Tensor Engine and RDMA Buffer

... continue reading