- Advertisement -
MiniMax handed an internal version of M2.7 a programming scaffold and let it run unsupervised. Over 100 rounds it analyzed its own failures, modified its own code, ran evaluations, and decided what to keep and what to revert. The result was a 30% performance improvement with nobody directing each step. That is not a benchmark result. That is a different way of thinking about how AI models get built.
M2.7 is now available on HuggingFace with weights you can download and deploy. NVIDIA is offering free API access if you want to try it without the hardware overhead. The license has a commercial limitation worth knowing about, we will get to that.
What self evolution actually means here
MiniMax used M2.7 during its own development to update memory, build skills for reinforcement learning experiments, and improve its own learning process based on experiment results. The model was a participant in its own training pipeline.
The clearest demonstration is the MLE Bench Lite result. MiniMax gave M2.7 access to 22 machine learning competitions, each runnable on a single A30 GPU, and let it run three 24 hour trials with a simple harness built around short term memory, self feedback, and self optimization. After each round the model generated a memory file, criticized its own results, and fed those observations into the next round.
The best run achieved 9 gold medals, 5 silver medals, and 1 bronze across those 22 competitions. The average medal rate across all three trials was 66.6%, second only to Opus 4.6 at 75.7% and GPT-5.4 at 71.2%.
What makes this interesting is not the medal count. It is that the improvement was continuous across all three 24 hour windows. The model kept finding better approaches the longer it ran, which connects directly to the long horizon behavior that makes agentic models actually useful in production.
What M2.7 can do
The benchmark that matters most for developers is SWE-Pro, which tests real software engineering across multiple programming languages. M2.7 scores 56.22%, matching GPT-5.3-Codex. On SWE Multilingual it scores 76.5 and on Multi SWE Bench 52.7, both of which test closer to real world engineering scenarios.
... continue reading