GoKawiil - Alignment is not free: How model upgrades can silence your confidence signals

The Flattening Calibration Curve The post-training process for LLMs can bias behavior for language models when they encounter content that violates their safety post-training guidelines. As mentioned by OpenAI’s GPT-4 system card, model calibration rarely survives post-training, resulting in models that are extremely confident even when they’re wrong.¹ For our use case, we often see this behavior with the side effect of biasing language model outputs towards violations, which can result in wasted review times for human reviewers in an LLM-powered content moderation system. Pre-training vs. Post-preference optimization calibration curves ‍ A Working Signal on GPT-4o Take the below histogram of log probs sampled from a golden dataset of false positives against GPT-4o. We can see that almost all outputs have log p≈0 nats (probability ≈ 1) for outputting “true”, indicating a true violation in this dataset. However, there are a few outliers in this dataset, almost all of which corresp ... Read full article.

Find Related products on Amazon

Alignment is not free: How model upgrades can silence your confidence signals

Related Articles