Auto-Architecture: Karpathy's Loop, Pointed at a CPU
What happens when you take an autonomous research loop out of its comfort zone and aim it at a domain it has no business being good at? Andrej Karpathy's autoresearch showed that a coding agent, given two days and a single-GPU nanochat, finds 20 training-time optimizations on its own. The recipe is general — propose, implement, measure, keep the wins — but the demonstration was inside the agent's home turf: Python, gradient descent, well-known knobs.
I wanted to know if it generalized. So I pointed it at a CPU.
The setup
auto-arch-tournament is a 5-stage in-order RV32IM core in SystemVerilog — the textbook pipeline you'd write in a graduate architecture class. No caches, no branch predictor, no multi-issue on day one. Those are research-loop hypotheses, not features.
The orchestrator is hardcoded. The LLM never edits it. Each round, three slots run in parallel:
The agent proposes a microarchitectural hypothesis as YAML, schema-checked against schemas/hypothesis.schema.json . An implementation agent edits files under rtl/ in an isolated git worktree. The eval gate runs: riscv-formal — 53 symbolic BMC checks (decode, traps, ordering, liveness, M-ext)
— 53 symbolic BMC checks (decode, traps, ordering, liveness, M-ext) Verilator cosim — RVFI byte-identical against a Python ISS, ~22% random bus stalls
— RVFI byte-identical against a Python ISS, ~22% random bus stalls 3-seed nextpnr P&R on a Gowin GW2A-LV18 (Tang Nano 20K) — median Fmax × CoreMark iter/cycle = fitness
on a Gowin GW2A-LV18 (Tang Nano 20K) — median Fmax × CoreMark iter/cycle = fitness CoreMark CRC validation — the same 4 CRCs VexRiscv reports against Improvement → merged into the trunk, becomes the new baseline. Regression / broken / placement-failed → worktree destroyed.
... continue reading