Build for developers, not benchmarks
Coding models are most useful when they perform well in the same environment developers use every day. That is why we built MAI-Code-1-Flash with production workflows at the center, rather than optimizing only for benchmarks. The model was trained directly with GitHub Copilot harnesses used in production. This allows it to learn how to interact with surrounding tools and systems in agentic coding tasks, making it uniquely well suited to real-world Copilot workflows compared to other available models.
During training, we evaluated checkpoints across core software engineering tasks, repository question answering, refactoring, and telemetry-grounded tasks adapted from real GitHub Copilot usage. This alignment between training, evaluation, and production helps offline improvements translate into real-world developer quality.
Designed to maximize value per token
MAI-Code-1-Flash was trained with adaptive solution length control, which helps the model adjust the depth of its response to the task. It can stay concise for simpler requests and spend more reasoning budget when a problem requires deeper analysis or broader code changes. In practice, this means developers start seeing useful output sooner. We see MAI-Code-1-Flash solving harder problems with up to 60% fewer tokens. This helps reduce latency, lower cost, improve return on token, and make interactive workflows feel smoother.
Benchmark results in the production harness
To understand both quality and efficiency, we evaluated MAI-Code-1-Flash against Claude Haiku 4.5 on SWE-Bench Verified, SWE-Bench Pro, SWE-Bench Multilingual, and Terminal Bench 2 using the same production harness that developers use for their everyday coding tasks. We measured task success and the average number of solution tokens required to complete each task.
MAI-Code-1-Flash outperforms Claude Haiku 4.5 across all core coding benchmarks tested, with higher pass rates on all 4 evaluations, including a +16-point lead on the diverse, real-world tasks of SWE-Bench Pro (51.2% vs. 35.2%). It’s not just smarter; it’s leaner, solving harder problems with up to 60% fewer tokens on SWE-Bench Verified, proving that higher accuracy and greater efficiency are no longer a trade-off.