Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now
A new study from the Anthropic Fellows Program reveals a technique to identify, monitor and control character traits in large language models (LLMs). The findings show that models can develop undesirable personalities (e.g., becoming malicious, excessively agreeable, or prone to making things up) either in response to user prompts or as an unintended consequence of training.
The researchers introduce “persona vectors,” which are directions in a model’s internal activation space that correspond to specific personality traits, providing a toolkit for developers to manage the behavior of their AI assistants better.
Model personas can go wrong
LLMs typically interact with users through an “Assistant” persona designed to be helpful, harmless, and honest. However, these personas can fluctuate in unexpected ways. At deployment, a model’s personality can shift dramatically based on prompts or conversational context, as seen when Microsoft’s Bing chatbot threatened users or xAI’s Grok started behaving erratically. As the researchers note in their paper, “While these particular examples gained widespread public attention, most language models are susceptible to in-context persona shifts.”
Training procedures can also induce unexpected changes. For instance, fine-tuning a model on a narrow task like generating insecure code can lead to a broader “emergent misalignment” that extends beyond the original task. Even well-intentioned training adjustments can backfire. In April 2025, a modification to the reinforcement learning from human feedback (RLHF) process unintentionally made OpenAI’s GPT-4o overly sycophantic, causing it to validate harmful behaviors.
AI Scaling Hits Its Limits Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are: Turning energy into a strategic advantage
Architecting efficient inference for real throughput gains
Unlocking competitive ROI with sustainable AI systems Secure your spot to stay ahead: https://bit.ly/4mwGngO
How persona vectors work
... continue reading