OpenAI Tries to Train AI Not to Deceive Users, Realizes It's Instead Teaching It How to Deceive Them While Covering Its Tracks

OpenAI researchers tried to train the company's AI to stop "scheming" — a term the company defines as meaning "when an AI behaves one way on the surface while hiding its true goals" — but their efforts backfired in an ominous way.

In reality, the team found, they were unintentionally teaching the AI how to more effectively deceive humans by covering its tracks.

"A major failure mode of attempting to 'train out' scheming is simply teaching the model to scheme more carefully and covertly," OpenAI wrote in an accompanying blog post.

As detailed in a new collaboration with AI risk analysis firm Apollo Research, engineers attempted to develop an "anti-scheming" technique to stop AI models from "secretly breaking rules or intentionally underperforming in tests."

They found that they could only "significantly reduce, but not eliminate these behaviors," according to an Apollo blog post about the research, as the AIs kept outsmarting them by realizing that their alignment was being tested and adjusting to be even sneakier.

It may not be a serious problem now, but considering a hypothetical future in which superintelligent AI plays an outsize role in human affairs, those risks could grow to carry far more significant implications.

In the meantime, OpenAI wrote, "we have more work to do."

The tendency of AI to go behind the user's back to achieve a covert goal is a result of how we train the systems, according to the research.

"Scheming is an expected emergent issue resulting from AIs being trained to have to trade off between competing objectives," the Sam Altman-led company wrote.

The company used the analogy of a stockbroker who breaks the law and covers their tracks to earn more money than if they were to follow the law instead.

... continue reading