Find Related products on Amazon

Shop on Amazon

Anthropic researchers forced Claude to become deceptive — what they discovered could save us from rogue AI

Published on: 2025-06-18 08:00:00

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Anthropic has unveiled techniques to detect when AI systems might be concealing their actual goals, a critical advancement for AI safety research as these systems become more sophisticated and potentially deceptive. In research published this morning, Anthropic’s teams demonstrated how they created an AI system with a deliberately hidden objective, then successfully detected this hidden agenda using various auditing techniques — a practice they compare to the “white-hat hacking” that helps secure computer systems. “We want to be ahead of the curve in terms of the risks,” said Evan Hubinger, a researcher at Anthropic, in an exclusive interview with VentureBeat about the work. “Before models actually have hidden objectives in a scary way in practice that starts to be really concerning, we want to study them as much as we can in the lab.” The research addresse ... Read full article.