OpenAI Tries to Train AI Not to Deceive Users, Realizes It's Instead Teaching It How to Deceive Them While Covering Its Tracks
OpenAI researchers tried to train the company's AI to stop "scheming" — a term the company defines as meaning "when an AI behaves one way on the surface while hiding its true goals" — but their efforts backfired in an ominous way. In reality, the team found, they were unintentionally teaching the AI how to more effectively deceive humans by covering its tracks. "A major failure mode of attempting to 'train out' scheming is simply teaching the model to scheme more carefully and covertly," OpenA