GoKawiil - Implementing DeepSeek R1's GRPO algorithm from scratch

GRPO training with minimal dependencies. We implement almost everything from scratch and only depend on tokenizers for tokenization and pytorch for training. No transformers and vLLM dependencies! and dependencies! The default config is set to run on a single A40 GPU (48GB VRAM) for a few hours to get good results. (A40 costs $0.44 per hour if you rent it from RunPod.) per hour if you rent it from RunPod.) We support several improvements over the original GRPO algorithm from the DAPO project, including: Token-level policy gradient loss : every token is equally weighted in the policy gradient loss. Removing KL Divergence : the KL divergence is not used in the policy gradient loss. This reduces GPU memory usage as we no longer need the reference policy network. Overlong episode filtering : skips unfinished episodes that exceed context length limits. This stabilizes training. Though we disabled it by default to observe model learning under limited context length. Set skip_unfinished_ep ... Read full article.

Find Related products on Amazon

Implementing DeepSeek R1's GRPO algorithm from scratch

Related Articles