Skip to content
Tech News
← Back to articles

Tree Search Distillation for Language Models Using PPO

read original more articles
Why This Matters

This research introduces a novel approach to enhance language models by integrating tree search techniques like MCTS with reinforcement learning, specifically PPO, to improve reasoning capabilities. The findings suggest that search distillation can lead to significant performance gains in complex reasoning tasks, potentially transforming how language models handle multi-step problem solving and reasoning. This could pave the way for more advanced, reasoning-aware AI systems that better understand and solve intricate problems.

Key Takeaways

← Back to all posts | Home

Tree Search Distillation for Language Models using PPO

03-01-2026 · Updated 03-03-2026

Game-playing neural networks like AlphaZero achieve superhuman performance in board games by augmenting the raw policy with a test-time search harness and distilling the stronger, augmented policy back into the network. Why aren’t similar techniques used in language modelling today? The DeepSeek-R1 authors mention they found limited success with MCTS; Finbarr Timbers has an excellent post on why they may have faced this problem, namely their choice of UCT instead of pUCT.

The purpose of this post is to explore two questions:

Is it possible that search distillation actually improves language model reasoning?

How does it perform relative to standard language RL methods, e.g. GRPO?

To explore this, I applied MCTS across reasoning steps to Qwen-2.5-1.5B-Instruct, to search for stronger trajectories and distill these back into the model via an online PPO loop. On the task of Countdown, a combinatorial arithmetic game, the distilled model (evaluated without a search harness) achieves an asymptotic mean@16 eval score of 11.3%, compared to 8.4% for CISPO and 7.7% for best-of-N. Relative to the pre-RL instruct model (3.1%), this is an 8.2 percentage point improvement.

The low absolute scores reflect the fact that these are small-scale experiments on a 1.5B model. I want to use this post as the first in a series, and hope to see these scores increase in subsequent blog posts as I use larger models and compute budgets.

Countdown

... continue reading