Understanding R1-Zero-Like Training: A Critical Perspective
Published on: 2025-10-02 18:35:12
Updates
21/03/2025: 🎉 We release our paper, models and codebase. Our R1-Zero training is implemented with 🌾 Oat, a highly modular, research-friendly and efficient LLM RL framework.
Links
Understanding R1-Zero-Like Training 📄 Paper 🤗 Models
There May Not Be Aha Moment in R1-Zero-like Training — A Pilot Study 📄 Blog 💻 Code
OAT: A research-friendly framework for LLM online alignment 💻 Codebase
To understand R1-Zero-like training, we critically examine two core components: base models and reinforcement learning. We highlight our findings below.
On base models:
DeepSeek-V3-Base already exhibit "Aha moment".
As the popular choice for R1-Zero-like training, Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates: the average benchmark scores improve by ~60% (compared to the traditional 4-shot prompting)!
On reinforcement learning:
GRPO leads to biased optimization! We propose a simple fix that improves token efficiency while maintaining reasoni
... Read full article.