MiniMax-M1 open-weight, large-scale hybrid-attention reasoning model

1. Model Overview

We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. Consistent with MiniMax-Text-01, the M1 model natively supports a context length of 1 million tokens, 8x the context size of DeepSeek R1. Furthermore, the lightning attention mechanism in MiniMax-M1 enables efficient scaling of test-time compute – For example, compared to DeepSeek R1, M1 consumes 25% of the FLOPs at a generation length of 100K tokens. These properties make M1 particularly suitable for complex tasks that require processing long inputs and thinking extensively. MiniMax-M1 is trained using large-scale reinforcement learning (RL) on diverse problems ranging from traditional mathematical reasoning to sandbox-based, real-world software engineering environments. We develop an efficient RL scaling framework for M1 highlighting two perspectives: (1) We propose CISPO, a novel algorithm that clips importance sampling weights instead of token updates, which outperforms other competitive RL variants; (2) Our hybrid-attention design naturally enhances the efficiency of RL, where we address unique challenges when scaling RL with the hybrid architecture. We train two versions of MiniMax-M1 models with 40K and 80K thinking budgets respectively. Experiments on standard benchmarks show that our models outperform other strong open-weight models such as the original DeepSeek-R1 and Qwen3-235B, particularly on complex software engineering, tool using, and long context tasks. With efficient scaling of test-time compute, MiniMax-M1 serves as a strong foundation for next-generation language model agents to reason and tackle real-world challenges.

Benchmark performance comparison of leading commercial and open-weight models across competition-level mathematics, coding, software engineering, agentic tool use, and long-context understanding tasks. We use the MiniMax-M1-80k model here for MiniMax-M1.

2. Evaluation

Performance of MiniMax-M1 on core benchmarks.

Category Task MiniMax-M1-80K MiniMax-M1-40K Qwen3-235B-A22B DeepSeek-R1-0528 DeepSeek-R1 Seed-Thinking-v1.5 Claude 4 Opus Gemini 2.5 Pro (06-05) OpenAI-o3 Extended Thinking 80K 40K 32k 64k 32k 32k 64k 64k 100k Mathematics AIME 2024 86.0 83.3 85.7 91.4 79.8 86.7 76.0 92.0 91.6 AIME 2025 76.9 74.6 81.5 87.5 70.0 74.0 75.5 88.0 88.9 MATH-500 96.8 96.0 96.2 98.0 97.3 96.7 98.2 98.8 98.1 General Coding LiveCodeBench (24/8~25/5) 65.0 62.3 65.9 73.1 55.9 67.5 56.6 77.1 75.8 FullStackBench 68.3 67.6 62.9 69.4 70.1 69.9 70.3 -- 69.3 Reasoning & Knowledge GPQA Diamond 70.0 69.2 71.1 81.0 71.5 77.3 79.6 86.4 83.3 HLE (no tools) 8.4* 7.2* 7.6* 17.7* 8.6* 8.2 10.7 21.6 20.3 ZebraLogic 86.8 80.1 80.3 95.1 78.7 84.4 95.1 91.6 95.8 MMLU-Pro 81.1 80.6 83.0 85.0 84.0 87.0 85.0 86.0 85.0 Software Engineering SWE-bench Verified 56.0 55.6 34.4 57.6 49.2 47.0 72.5 67.2 69.1 Long Context OpenAI-MRCR (128k) 73.4 76.1 27.7 51.5 35.8 54.3 48.9 76.8 56.5 OpenAI-MRCR (1M) 56.2 58.6 -- -- -- -- -- 58.8 -- LongBench-v2 61.5 61.0 50.1 52.1 58.3 52.5 55.6 65.0 58.8 Agentic Tool Use TAU-bench (airline) 62.0 60.0 34.7 53.5 -- 44.0 59.6 50.0 52.0 TAU-bench (retail) 63.5 67.8 58.6 63.9 -- 55.7 81.4 67.0 73.9 Factuality SimpleQA 18.5 17.9 11.0 27.8 30.1 12.9 -- 54.0 49.4 General Assistant MultiChallenge 44.7 44.7 40.0 45.0 40.7 43.0 45.8 51.8 56.5

* conducted on the text-only HLE subset.

Our models are evaluated with temperature=1.0 , top_p=0.95 .

SWE-bench methodology

... continue reading