Skip to content
Tech News
← Back to articles

ZAYA1-8B: An 8B Moe Model with 760M Active Params Matching DeepSeek-R1 on Math

read original get AI Language Model Book → more articles
Why This Matters

ZAYA1-8B demonstrates that high-performance AI models can be built and trained on AMD hardware, challenging NVIDIA's dominance in the industry. Its competitive benchmarks in math, reasoning, and coding highlight the potential for more diverse and cost-effective AI infrastructure options. This development signals a shift towards hardware independence and broader accessibility in AI research and deployment.

Key Takeaways

- Advertisement -

Zyphra just dropped a model that’s doing something most people will scroll past without understanding why it’s interesting.

ZAYA1-8B matches DeepSeek-R1 on math benchmarks. Stays competitive with Claude Sonnet 4.5 on reasoning. Closes in on Gemini 2.5 Pro on coding. These are frontier model comparisons, the kind of numbers that usually come with billions of parameters and serious hardware requirements.

This one runs on less than 1 billion active parameters. And it was trained entirely on AMD hardware, which almost no serious model can say.

Built on AMD Instead of NVIDIA.

Every model you’ve heard of was trained on NVIDIA hardware mostly on H100s, A100s, GB200s. The entire open source AI ecosystem has been built on a de facto NVIDIA monopoly and most labs don’t even mention the hardware because there’s nothing to mention, it’s always NVIDIA.

Zyphra trained ZAYA1-8B end to end on AMD Instinct MI300X GPUs. Pretraining, midtraining, supervised fine-tuning, all of it on a 1,024 node AMD cluster built with IBM using AMD Pensando Pollara interconnect.

That detail matters for two reasons. First it proves the AMD stack can produce frontier-competitive results at this scale, which matters for anyone thinking about infrastructure that isn’t locked into NVIDIA pricing.

Second it means Zyphra had to solve real engineering problems that most labs never encounter because they default to CUDA. The fact that the model performs this well coming off that stack says something about both the hardware and the team. It’s a proof of concept for an alternative path that the industry needs.

Less than 1B active parameters.

... continue reading