Direct Preference Optimization vs. RLHF
Published on: 2025-06-16 08:50:36
We're excited to announce that the Together Fine-Tuning Platform now supports Direct Preference Optimization (DPO)! This technique allows developers to align language models with human preferences creating more helpful, accurate, and tailored AI assistants. In this deep-dive blogpost, we provide details of what DPO is, how it works, when to use it and code examples. If you'd like to jump straight into code have a look at our code notebook.
Tuning LLMs on Preference Data
Modern language model development typically follows a three-stage process:
Pre-training on internet-scale data to build a foundation model with broad knowledge Supervised fine-tuning (SFT) on specific high-quality examples to adapt a model to a particular knowledge domain or task Preference-based learning to refine the model based on human preferences
This final stage, preference learning, is where DPO comes in as an alternative to Reinforcement Learning from Human Feedback (RLHF). It ensures that models not only p
... Read full article.