GoKawiil - OpenAI Scientists' Efforts to Make an AI Lie and Cheat Less Backfired Spectacularly

Punishing bad behavior can often backfire. That's what OpenAI researchers recently found out when they tried to discipline their frontier AI model for lying and cheating: instead of changing its ways for the better, the AI model simply became more adept at hiding its deceptive practices. The findings, published in a yet-to-be-peer-reviewed paper, are the latest to highlight the proclivity of large language models, especially ones with reasoning capabilities, for fibbing, in what remains one of the major obstacles for the tech. In particular, the phenomenon the researchers observed is known as "reward hacking," or when an AI model takes dubious shortcuts to reap rewards in a training scenario designed to reinforce desired behavior. Or in a word: cheating. "As we've trained more capable frontier reasoning models, we've found that they have become increasingly adept at exploiting flaws in their tasks and misspecifications in their reward functions, resulting in models that can perform ... Read full article.

Find Related products on Amazon

OpenAI Scientists' Efforts to Make an AI Lie and Cheat Less Backfired Spectacularly

Related Articles