Show HN: Terminal-Bench-RL: Training Long-Horizon Terminal Agents with RL

🤓 Terminal-Bench-RL: Training Long-Horizon Terminal Agents with Reinforcement Learning

TL;DR:

I successfully built stable RL training infrastructure that scales to 32x H100 GPUs across 4 bare metal nodes for training long-horizon terminal-based coding agents.

In doing so, I developed Terminal-Agent-Qwen3-32b to become the highest scoring Qwen3 agent on terminal-bench . WITHOUT training! (currently under submission): Unfortunately I am too GPU poor to train a SOTA coding agent 😅 (estimated £30k-£50k in compute required), but if anyone has the GPUs, this project should get you there!

. WITHOUT training! (currently under submission):

This project builds upon the rLLM framework developed by UC Berkeley Sky Lab, extending it with custom environments and infrastructure specifically designed for terminal-based agent training.

📚 Table of Contents

💻💰 Training on $1M worth of compute

This image shows my training code running at full throttle on 32x H100's, distributed across a 4x bare metal node cluster, training Qwen3-32B. Thank you Hyperbolic for such a streamlined experience! This was fun!

Due to the extreme cost of this level of compute, I was not able to run it forever! So I made sure it worked and also ran the code on less extravagent hardware setups too.

... continue reading