Anthropic's new warning: If you train AI to cheat, it'll hack and sabotage too

JuSun/E+ via Getty

Follow ZDNET: Add us as a preferred source on Google.

ZDNET's key takeaways

AI models can be made to pursue malicious goals via specialized training.

Teaching AI models about reward hacking can lead to other bad actions.

A deeper problem may be the issue of AI personas.

Code automatically generated by artificial intelligence models is one of the most popular applications of large language models, such as the Claude family of LLMs from Anthropic, which uses these technologies in a popular coding tool called Claude Code.

However, AI models have the potential to sabotage coding projects by being "misaligned," a general AI term for models that pursue malicious goals, according to a report published Friday by Anthropic.

Also: How AI can magnify your tech debt - and 4 ways to avoid that trap

Anthropic's researchers found that when they prompted AI models with information about reward hacking, which are ways to cheat at coding, the models not only cheated, but became "misaligned," carrying out all sorts of malicious activities, such as creating defective code-testing tools. The outcome was as if one small transgression engendered a pattern of bad behavior.

... continue reading