🤓 Terminal-Bench-RL: Training Long-Horizon Terminal Agents with Reinforcement Learning
TL;DR:
I successfully built stable RL training infrastructure that scales to 32x H100 GPUs across 4 bare metal nodes for training long-horizon terminal-based coding agents.
In doing so, I developed Terminal-Agent-Qwen3-32b to become the highest scoring Qwen3 agent on terminal-bench . WITHOUT training! (currently under submission): Unfortunately I am too GPU poor to train a SOTA coding agent 😅 (estimated £30k-£50k in compute required), but if anyone has the GPUs, this project should get you there!
. WITHOUT training! (currently under submission):
This project builds upon the rLLM framework developed by UC Berkeley Sky Lab, extending it with custom environments and infrastructure specifically designed for terminal-based agent training.
📚 Table of Contents
💻💰 Training on $1M worth of compute
This image shows my training code running at full throttle on 32x H100's, distributed across a 4x bare metal node cluster, training Qwen3-32B. Thank you Hyperbolic for such a streamlined experience! This was fun!
Due to the extreme cost of this level of compute, I was not able to run it forever! So I made sure it worked and also ran the code on less extravagent hardware setups too.
... continue reading