Skip to content
Tech News
← Back to articles

Ornith-1.0: self-improving open-source models for agentic coding

read original more articles
Why This Matters

Ornith-1.0 introduces self-improving open-source models for agentic coding, achieving state-of-the-art performance across various benchmarks. Its innovative reinforcement learning framework enhances solution quality by jointly optimizing search trajectories and solutions, making it a significant advancement for open-source AI coding tools. This development empowers developers and organizations with more capable, accessible, and adaptable coding models, potentially accelerating software innovation.

Key Takeaways

Aloha! 🌺 Ornith-1.0 is a self-improving open-source models for agentic coding.

Highlights:

State-of-the-Art Coding Agents : Available in 9B-Dense, 31B-Dense, 35B-MoE, and 397B-MoE (post-trained on top of Gemma 4 and Qwen 3.5), achieving state-of-the-art performance among open-source models of comparable size on coding benchmarks such as Terminal-Bench 2.1, SWE-Bench, NL2Repo and OpenClaw.

: Available in 9B-Dense, 31B-Dense, 35B-MoE, and 397B-MoE (post-trained on top of Gemma 4 and Qwen 3.5), achieving state-of-the-art performance among open-source models of comparable size on coding benchmarks such as Terminal-Bench 2.1, SWE-Bench, NL2Repo and OpenClaw. Self-Improving Training Framework : Ornith-1.0 employs RL to learn to generate not only solution rollouts, but also the scallfold that drive those rollouts. By jointly optimizing the scaffold and the resulting solution, the model discovers better search trajectories and generates higher-quality solutions.

: Ornith-1.0 employs RL to learn to generate not only solution rollouts, but also the scallfold that drive those rollouts. By jointly optimizing the scaffold and the resulting solution, the model discovers better search trajectories and generates higher-quality solutions. Licence: MIT licensed, globally accessible, and free from regional limitations.

Benchmarks

Each model is evaluated against its size-appropriate baselines. All three use the same harnesses and decoding setup (see the notes under the tables).

Ornith-1.0-9B Qwen3.5-9B Qwen3.5-35B Gemma4-12B Gemma4-31B Agentic Coding Terminal-Bench 2.1 (Terminus-2) 43.1 21.3 41.4 21 42.1 Terminal-Bench 2.1 (Claude Code) 40.6 18.9 38.9 - - SWE-bench Verified 69.4 53.2 70 44.2 52 SWE-bench Pro 42.9 31.3 44.6 27.6 35.7 SWE-bench Multilingual 52 39.7 60.3 32.5 51.7 NL2Repo 27.2 16.2 20.5 10.3 15.5 Claw-eval Avg 63.1 53.2 65.4 32.5 48.5 SWE Atlas - QnA 17.9 9.2 13.2 - - SWE Atlas - RF 16.6 4.3 10.2 - - SWE Atlas - TW 15.3 4.4 9.8 - -

Ornith-1.0-35B Qwen3.5-35B Qwen3.6-35B Gemma4-31B Qwen3.5-397B Agentic Coding Terminal-Bench 2.1 (Terminus-2) 64.2 41.4 52.5 42.1 53.5 Terminal-Bench 2.1 (Claude Code) 62.8 38.9 49.2 - 48.6 SWE-bench Verified 75.6 70 73.4 52 76.4 SWE-bench Pro 50.4 44.6 49.5 35.7 51.6 SWE-bench Multilingual 69.3 60.3 67.2 51.7 69.3 NL2Repo 34.6 20.5 29.4 15.5 36.8 Claw-eval Avg 69.8 65.4 68.7 48.5 70.7 SWE Atlas - QnA 37.1 13.2 15.5 - 20.4 SWE Atlas - RF 29.7 10.2 11.4 - 18.4 SWE Atlas - TW 27.8 9.8 13.3 - 18.5

Ornith-1.0-397B Qwen3.5-397B Qwen3.7-Max GLM-5.2-744B Minimax-M3-428B DeepSeek-V4-Pro-1.6T Claude Opus 4.7 Claude Opus 4.8 Agentic Coding Terminal-Bench 2.1 (Terminus-2) 77.5 53.5 73.5 81.0 64 64 70.3 85 Terminal-Bench 2.1 (Claude Code) 78.2 48.6 69.8 82.7 - 66.5 69.7 78.9 SWE-bench Verified 82.4 76.4 80.4 - - 80.6 80.8 87.6 SWE-bench Pro 62.2 51.6 60.6 62.1 59 55.4 64.3 69.2 SWE-bench Multilingual 78.9 69.3 78.3 - - 76.2 - - NL2Repo 48.2 36.8 47.2 48.9 42.1 - - 69.7 Claw-eval Avg 77.1 70.7 65.2 - - 75.8 78.2 - SWE Atlas - QnA 41.2 20.4 - - 37.9 27.2 40.3 48.8 SWE Atlas - RF 42.6 18.4 - - - - 48.6 46.7 SWE Atlas - TW 39.1 18.5 - - 30.8 - 38.5 -

... continue reading