Skip to content
Tech News
← Back to articles

Flash-MoE: Running a 397B Parameter Model on a Laptop

read original more articles
Why This Matters

The Flash-MoE breakthrough demonstrates that a 397-billion-parameter model can run efficiently on a consumer-grade MacBook Pro, leveraging custom low-level programming and SSD streaming. This development could democratize access to large AI models, reducing reliance on expensive cloud infrastructure and enabling more widespread experimentation and deployment. It signifies a major step toward portable, high-performance AI applications for both developers and consumers.

Key Takeaways

Flash-MoE: Running a 397B Parameter Model on a Laptop

Read the paper — Full technical details, 90+ experiments, and the story of how an AI and a human built this in 24 hours.

Pure C/Metal inference engine that runs Qwen3.5-397B-A17B (a 397 billion parameter Mixture-of-Experts model) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second with production-quality output including tool calling.

The entire 209GB model streams from SSD through a custom Metal compute pipeline. No Python. No frameworks. Just C, Objective-C, and hand-tuned Metal shaders.

Results

Configuration tok/s Quality Notes 4-bit experts, FMA kernel 4.36 Excellent Current best. Full tool calling. 209GB on disk. 4-bit experts, baseline 3.90 Excellent Before FMA kernel optimization. 2-bit experts, trust OS 5.74 Good* 120GB on disk. *Breaks JSON/tool calling. 2-bit peak single token 7.05 Good* Warm cache burst. *Not suitable for tool use.

*2-bit quantization produces

ame\ instead of "name" in JSON output, making tool calling unreliable. 4-bit is the production configuration.

Hardware

Machine : MacBook Pro, Apple M3 Max

... continue reading