Flash-MoE: Running a 397B Parameter Model on a Laptop
Read the paper — Full technical details, 90+ experiments, and the story of how an AI and a human built this in 24 hours.
Pure C/Metal inference engine that runs Qwen3.5-397B-A17B (a 397 billion parameter Mixture-of-Experts model) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second with production-quality output including tool calling.
The entire 209GB model streams from SSD through a custom Metal compute pipeline. No Python. No frameworks. Just C, Objective-C, and hand-tuned Metal shaders.
Results
Configuration tok/s Quality Notes 4-bit experts, FMA kernel 4.36 Excellent Current best. Full tool calling. 209GB on disk. 4-bit experts, baseline 3.90 Excellent Before FMA kernel optimization. 2-bit experts, trust OS 5.74 Good* 120GB on disk. *Breaks JSON/tool calling. 2-bit peak single token 7.05 Good* Warm cache burst. *Not suitable for tool use.
*2-bit quantization produces
ame\ instead of "name" in JSON output, making tool calling unreliable. 4-bit is the production configuration.
Hardware
Machine : MacBook Pro, Apple M3 Max
... continue reading