Researchers astonished by tool’s apparent success at revealing AI’s “hidden objectives”
Published on: 2025-06-14 02:03:41
In a new paper published Thursday titled "Auditing language models for hidden objectives," Anthropic researchers described how custom AI models trained to deliberately conceal certain "motivations" from evaluators could still inadvertently reveal secrets, due to their ability to adopt different contextual roles they call "personas." The researchers were initially astonished by how effectively some of their interpretability methods seemed to uncover these hidden training objectives, although the methods are still under research.
While the research involved models trained specifically to conceal information from automated software evaluators called reward models (RMs), the broader purpose of studying hidden objectives is to prevent future scenarios where AI systems might deceive or manipulate human users.
While training a language model using reinforcement learning from human feedback (RLHF), reward models are typically tuned to score AI responses according to how well they align with
... Read full article.