‘Subliminal learning’: Anthropic uncovers how AI fine-tuning secretly teaches bad habits
Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now A new study by Anthropic shows that language models might learn hidden characteristics during distillation, a popular method for fine-tuning models for special tasks. While these hidden traits, which the authors call “subliminal learning,” can be benign, the research finds they can also lead to unwanted results, such as misalignment and har