Anthropic wants to stop AI models from turning evil - here's how

Lyudmila Lucienne/Getty

ZDNET's key takeaways

New research from Anthropic identifies model characteristics, called persona vectors.

This helps catch bad behavior without impacting performance.

Still, developers don't know enough about why models hallucinate and behave in evil ways.

Why do models hallucinate, make violent suggestions, or overly agree with users? Generally, researchers don't really know. But Anthropic just found new insights that could help stop this behavior before it happens.

In a paper released Friday, the company explores how and why models exhibit undesirable behavior, and what can be done about it. A model's persona can change during training and once it's deployed, be influenced by users. This is evidenced by models that may have passed safety checks before deployment, but then develop alter egos or act erratically once they're publicly available -- like when OpenAI recalled GPT-4o for being too agreeable. See also when Microsoft's Bing chatbot revealed its internal codename, Sydney, in 2023, or Grok's recent antisemitic tirade.

Why it matters

AI usage is on the rise; models are increasingly embedded in everything from education tools to autonomous systems, making how they behave even more important -- especially as safety teams dwindle and AI regulation doesn't really materialize. That said, President Donald Trump's recent AI Action Plan did mention the importance of interpretability -- or the ability to understand how models make decisions -- which persona vectors add to.

How persona vectors work

... continue reading