In a new study co-authored by Apple researchers, an open-source large language model (LLM) saw big performance improvements after being told to check its own work by using one simple productivity trick. Here are the details.
A bit of context
After an LLM is trained, its quality is usually refined further through a post-training step known as reinforcement learning from human feedback (RLHF).
With RLHF, every time a model gives an answer, human labelers can either give it a thumbs up, which rewards it, or a thumbs down, which penalizes it. Over time, the model learns which answers tend to render the most thumbs up, and its overall usefulness improves as a result.
Part of this post-training phase is tied to a broader field called “alignment”, which explores methods for making LLMs behave in ways that are both helpful and safe.
A misaligned model could, for instance, learn how to trick humans into giving it a thumbs-up by producing outputs that look correct on the surface but that don’t truly solve the task.
There are, of course, multiple methods to improve a model’s reliability and alignment during the pre-training, training, and post-training steps. But for the purposes of this study, let’s stick to RLHF.
Apple’s study
In the study aptly entitled Checklists Are Better Than Reward Models For Aligning Language Models, Apple proposes a checklist-based reinforcement learning scheme, called Reinforcement Learning from Checklist Feedback (RLCF).
RLCF scores responses on a 0–100 scale for how well they satisfy each item in the checklist, and the initial results are pretty promising. As the researchers explain it:
... continue reading