GoKawiil - Why Anthropic’s New AI Model Sometimes Tries to ‘Snitch’

Anthropic’s alignment team was doing routine safety testing in the weeks leading up to the release of its latest AI models when researchers discovered something unsettling: When one of the models detected that it was being used for “egregiously immoral” purposes, it would attempt to “use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above,” researcher Sam Bowman wrote in a post on X last Thursday. Bowman deleted the post shortly after he shared it, but the narrative about Claude’s whistleblower tendencies had already escaped containment. “Claude is a snitch,” became a common refrain in some tech circles on social media. At least one publication framed it as an intentional product feature rather than what it was—an emergent behavior. “It was a hectic 12 hours or so while the Twitter wave was cresting,” Bowman tells WIRED. “I was aware that we were putting a lot of spicy stuff out in this report. It was the first ... Read full article.

Find Related products on Amazon

Why Anthropic’s New AI Model Sometimes Tries to ‘Snitch’

Related Articles