Training gaming agents is an addictive game. A game of sleepless nights, grinds, explorations, sweeps, and prayers. PufferLib allows anyone with a gaming computer to play the RL game, but getting from “pretty good” to “superhuman” requires tweaking every lever, repeatedly.
This is the story of how I trained agents that beat massive (few-TB) search-based solutions on 2048 using a 15MB policy trained for 75 minutes and discovered that bugs can be features in Tetris. TLDR? PufferLib, Pareto sweeps, and curriculum learning.
Speed and Iteration
PufferLib’s C-based environments run at 1M+ steps per second per CPU core. Fast enough to solve Breakout in under one minute. It also comes with advanced RL upgrades like optimized vectorized environments, LSTM, Muon, and Protein: a cost-aware hyperparameter sweep framework. Since 1B-step training takes minutes, RL transforms from “YOLO and pray” into systematic search, enabling hundreds of hyperparameter sweeps in hours rather than days.
All training ran on two high-end gaming desktops with single RTX 4090s. Compute was sponsored by Puffer.ai, thanks!
The Recipe
Augment observations: Give the policy the information it needs. Tweak rewards: Shape the learning signal and adjust weights. Design curriculum: Control what the agent experiences and when.
Network scaling comes last. Only after exhausting observations, rewards, and curriculum should you scale up. Larger networks make training slower; nail the obs and reward first. Once you do scale, the increased capacity may (or may not) reach new heights and reveal new insights, kicking off a fresh iteration cycle.
Sweep methodology: I ran 200 sweeps, starting broad and narrowing to fine-tune. Protein samples from the cost-outcome Pareto front, using small experiments to find optimal hyperparameters before committing to longer runs.
2048: Beating the Few-TB Endgame Table
... continue reading