Achieving 10,000x training data reduction with high-fidelity labels

Classifying unsafe ad content has proven an enticing problem space for leveraging large language models (LLMs). The inherent complexity involved in identifying policy-violating content demands solutions capable of deep contextual and cultural understanding, areas of relative strength for LLMs over traditional machine learning systems. But fine-tuning LLMs for such complex tasks requires high-fidelity training data that is difficult and expensive to curate at the necessary quality and scale. Standard data-intensive approaches to training models are costly, especially given the need to handle concept drift as safety policies evolve or as new types of unsafe ad content arise. In the worst case the model must be retrained on a completely new data set. Reducing the amount of training data needed is therefore paramount.

With this in mind, we describe a new, scalable curation process for active learning that can drastically reduce the amount of training data needed for fine-tuning LLMs while significantly improving model alignment with human experts. The process can be applied to datasets of hundreds of billions of examples to iteratively identify the examples for which annotation would be most valuable and then use the resulting expert labels for fine-tuning.

In our experiments, we were able to reduce the scale of training data needed from 100,000 to under 500 training examples, while increasing model alignment with human experts by up to 65%. Production systems using larger models have seen even greater reductions in data scale, using up to four orders of magnitude less data while maintaining or improving quality.