Ornith-1.0: Self-scaffolding LLMs for agentic coding

The edge-deployable Ornith-1.0-9B also delivers remarkably strong results, achieving 43.1 on Terminal-Bench 2.1 and 69.4 on SWE-Bench Verified. Despite being a compact 9B-parameter model, it matches or exceeds the performance of much larger models such as Gemma 4-31B, demonstrating that strong agentic coding capabilities can be achieved even in resource-efficient deployments.

Ornith-1.0-35B significantly outperforms similarly sized models, including Qwen 3.5-35B, Qwen 3.6-35B, and Gemma 31B. Despite having only 35B parameters, it even surpasses Qwen 3.5-397B on Terminal-Bench 2.1 (64.4 vs. 53.5) while matching its performance across several other coding and agentic benchmarks.

At the flagship scale, Ornith-1.0-397B achieves 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified, surpassing Claude Opus 4.7 on both benchmarks and outperforming leading open-source models of similar size, including Minimax M3 and DeepSeek-V4-Pro.

Ornith-1.0 achieves state-of-the-art performance among open-source models of comparable size across a broad range of agentic coding benchmarks: Ornith-1.0-397B ( 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified) matches the performance of Claude Opus 4.7 ( 70.3 on TB-2.1 and 80.8 on SWE-Bench Verified) and outperforming leading open-source models of similar size, including MiniMax M3 ( 66.0 on TB-2.1 and 80.5 on SWE-Bench Verified) and DeepSeek-V4-Pro ( 67.9 on TB-2.1 and 80.6 on SWE-Bench Verified). Ornith-1.0-9B, which can be easily deployed on edge devices, matches or exceeds the performance of much larger models such as Gemma 4-31B and Qwen 3.6 35B.

The key innovation behind Ornith-1.0 is a self-improving training framework. Instead of relying on human-designed harnesses to drive solution generation in RL, Ornith-1.0 learns to generate both solution rollouts and the task-specific harnesses that guide those rollouts. By jointly optimizing the scaffold and the resulting solution, the model can discover better search trajectories and generate higher-quality solutions.

Today, we are introducing Ornith-1.0 , a self-improving family of open-source models specially for agentic coding tasks. Ornith-1.0 spans the full spectrum, from compact 9B Dense models suitable for edge device deployment to 397B MoE frontier-scale models optimized for maximum performance, with variants including 9B Dense, 31B Dense, 35B MoE, and 397B MoE . Built on top of pretrained Gemma 4 and Qwen 3.5, it achieves state-of-the-art performance among open-source models of comparable size on coding benchmarks.

A Self-improving Strategy for LLM Training

At the core of Ornith-1.0 is a self-improving training framework that jointly learns to solve tasks and to construct the scaffolds that guide those solutions. Rather than relying on a fixed, human-designed harness shared across a task category, Ornith-1.0 treats the scaffold as a learnable object that co-evolves with the policy.

Each RL step proceeds in two stages: conditioned on a task and the scaffold previously used for it, the model first proposes a refined scaffold; conditioned on that scaffold and the task description, it then generates a solution rollout. Reward from the rollout is propagated to both stages, so the model is optimized not only to produce better answers but to author the orchestration that elicits them.

Repeated over training, this yields a feedback loop in which scaffolds are continually mutated and selected toward those that induce higher-reward trajectories, allowing per-task-category strategies to emerge automatically and driving sustained capability gains without hand-engineered harness design.

... continue reading