The latest nightly releases of Mojo (and our next stable release) include initial support for a new accelerator architecture: Apple Silicon GPUs!
We know that one of the biggest barriers to programming GPUs is access to hardware. It’s our hope that by making it possible to use Mojo to develop for a GPU present in every modern Mac, we can further democratize developing GPU-accelerated algorithms and AI models. This should also enable new paths of local-to-cloud development for AI models and more.
To get started, you need to have an Apple Silicon Mac (we support all M1 - M4 series chips) running macOS 15 or newer, with Xcode 16 or newer installed. The version of the Metal Shading Language we use (3.2, AIR bitcode version 2.7.0) needs the macOS 15 SDK, and you’ll get an error about incompatible bitcode versions if you run on an older macOS or use an older version of Xcode that doesn’t have the macOS 15 SDK.
You can clone our modular repository and try out one of our GPU function examples in the examples/mojo/gpu-functions directory. All but the reduction.mojo example should work on Apple Silicon GPUs today in the latest nightlies. Additionally, puzzles 1-15 of the Mojo GPU puzzles should now work on Apple Silicon GPUs with the latest nightly. We haven’t yet updated the Pixi environment for the GPU puzzles to add Apple Silicon support, so for now you may need to run the Mojo code manually from another environment.
Current capabilities
This is just the beginning of our support for Apple Silicon GPUs, and many pieces of functionality still need to be built out. Known features that don’t work today include:
Intrinsics for many hardware capabilities Not all Mojo GPU examples work, such as reduction.mojo and the more complex matrix multiplication examples GPU puzzles 16 and above need more advanced hardware features
Basic MAX graphs
MAX custom ops
PyTorch interoperability
Running AI models
Serving AI models
I’ll emphasize that even simple MAX graphs, and by extension AI models, don’t yet run on Apple Silicon GPUs. In our Python APIs, accelerator_count() will still return 0 until we have basic MAX graph support enabled. Hopefully, that won’t be long.
Next steps
We’ve identified many of the technical blockers to progressively enable the above. The current list of what we plan to work on includes:
Handle MAX_THREADS_PER_BLOCK_METADATA and similar aliases
and similar aliases Support GridDim , lane_id
, Enable async_copy_*
Convert arguments of an array type to a pointer type
Support bfloat16 on ARM devices
on ARM devices Support SubBuffer
Enable atomic operations
Complete implementation of MetalDeviceContext::synchronize
Enable captured arguments
Support print and debug_assert
I apologize for some of the cryptic error messages you may get when hitting a piece of missing functionality, or encountering a system configuration we aren’t yet compatible with. We hope to improve the messaging over time, and to provide better guides for debugging failures.
How this works
To learn more about how Mojo code is compiled to target Apple Silicon GPUs, check out Amir Nassereldine’s detailed technical presentation from our recent Modular Community Meeting. He did amazing work in establishing the fundamentals during his summer internship, and we are now building on that to advance Mojo on this new architecture.
In brief, a multi-step process is used to compile and run Mojo code on an Apple Silicon GPU. First, we compile Mojo GPU functions to Apple Intermediate Representation (AIR) bitcode. This is done through lowering to LLVM IR, and then specifically converting to Metal-compatible AIR.
Mojo handles interactions with an accelerator through the DeviceContext type. In the case of Apple Silicon GPUs, we’ve specialized this into a MetalDeviceContext that handles the next stages in compilation and execution.
The MetalDeviceContext uses the Metal-cpp API to compile the AIR representation into a .metallib for execution on device. Once the .metallib is ready, the MetalDeviceContext manages a Metal CommandQueue , and buffers operations for moving data, running a GPU function, and more. All of this happens behind-the-scenes and a Mojo developer doesn’t need to worry about any of it.
Code that you’ve written to run on an NVIDIA or AMD GPUs should mostly just work on an Apple Silicon GPU, assuming no device-specific features were being used. Obviously, different patterns will be required to get the most performance out of each GPU, and we’re excited to explore this new optimization space on Apple Silicon GPUs with you.
Just the beginning
While we’d love help in bringing up Apple Silicon GPU support, some of the infrastructure for introducing support for new AIR intrinsics and compiling them to a .metallib currently requires Modular developers for implementation. We’ll get more of the basics in place before work moves primarily to the open-source standard library and kernels, at which point community members will be able to do a lot more to advance compatibility. Contributions are always welcome, but we don’t want you to hit missing non-public components and get frustrated by being unable to move forward.
We’ll share much more documentation and content on how to work with and optimize for this new hardware family, but we’re extremely excited about even these first few steps onto Apple Silicon GPUs. I’ll to try to keep this post up to date as we expand functionality.