Chess Engines Do Weird Stuff
Things LLM people can learn from
Training method
Since AlphaZero, lc0-style chess engines have been trained with RL. Specifically, you have the engine (search + model) play itself a bunch of times, and train the model to predict the outcome of the game.
It turns out this isn't necessary. Good model vs bad model is ~200 elo, but search is ~1200 elo, so even a bad model + search is essentially an oracle to a good model without, and you can distill from bad model + search → good model.
So RL was necessary in some sense only one time. Once a good model with search was trained, every future engine (including their competitors!)1 can distill from that, and doesn't have to generate games (expensive). lc0 trained their premier model, BT4, with distillation and it got worse when you put it in the RL loop.
What makes distillation from search so powerful? People often compare this to distilling from best-of-n in RL, which I think is limited — a chess engine that runs the model on 50 positions is roughly equivalent to a model 30x larger, whereas LLM best-of-50 is generously worth a model 2x larger. Perhaps this was why people wanted test-time search to work so badly when RLVR was right under their noses.
Training at runtime
A recent technique is applying the distillation trick at runtime. At runtime, you evaluate early positions with your NN, then search them and get a more accurate picture. If your network says the position is +0.15 pawns better than search says, subtract 0.15 pawns from future evaluations. Your network live adapts to the position it's in!
Training on winning
... continue reading