Agentic Coding
Foundation models are shifting from answering questions to taking action, and in the digital world that action takes the form of code. Coding is the substrate of digital agency, the purest form of the plan–execute–observe–iterate loop, and the leading indicator of where a model's broader agentic capability is heading. We invested heavily in this surface for Step 3.7 Flash. Compared to Step 3.5 Flash, it gains +5% on SWE-Bench Pro and 6.1% on Terminal-Bench 2.1.
Step-SWE-Bench Step 3.5 Flash · Hermes Agent: 60.00% Step 3.5 Flash · OpenClaw: 47.00% Step 3.5 Flash · Claude Code: 73.00% Step 3.5 Flash · KiloCode: 59.00% Step 3.5 Flash · OpenCode: 57.00% Step 3.5 Flash · RooCode: 43.00% Step 3.7 Flash · Hermes Agent: 67.50% Step 3.7 Flash · OpenClaw: 67.00% Step 3.7 Flash · Claude Code: 71.50% Step 3.7 Flash · KiloCode: 67.50% Step 3.7 Flash · OpenCode: 64.50% Step 3.7 Flash · RooCode: 64.50% 75 60 45 30 Hermes Agent OpenClaw Claude Code KiloCode OpenCode RooCode Step 3.7 Flash avg 67.08% Step 3.5 Flash avg 56.50% Step 3.7 Flash Step 3.5 Flash Hermes Agent 67.50% 60.00% OpenClaw 67.00% 47.00% Claude Code 71.50% 73.00% KiloCode 67.50% 59.00% OpenCode 64.50% 57.00% RooCode 64.50% 43.00%
In production, coding agents rarely run on a single scaffold. They live inside a heterogeneous stack of harnesses, each with its own prompting conventions, tool schemas, and orchestration patterns — and a model has to perform reliably across all of them to be genuinely useful. Step 3.7 Flash is markedly more balanced across this stack than Step 3.5 Flash, with the per-harness gap narrowing substantially on our in-house Step-SWE-Bench.
To push quality further without giving up Flash-tier efficiency, Step 3.7 Flash supports Advisor Mode. Step 3.7 Flash drives the trajectory end-to-end — calling tools, reading results, and iterating — and consults a larger advisor model only at the few inflection points where its own judgment falls short, such as planning or recovering from repeated failures. This is Step's implementation of the advisor strategy described by Anthropic, where a small executor stays in control and escalates to a frontier advisor only when needed, keeping most of the run at executor cost. With Advisor Mode enabled, Step 3.7 Flash reaches 97% of Claude Opus 4.6's coding performance at roughly one-ninth the per-task cost ( $0.19 v.s. $1.76 per task).