OpenAI is training models to 'confess' when they lie - what it means for future AI

antonioiacobelli/RooM via Getty Images

Follow ZDNET: Add us as a preferred source on Google.

ZDNET's key takeaways

OpenAI trained GPT-5 Thinking to confess to misbehavior.

It's an early study, but it could lead to more trustworthy LLMs.

Models will often hallucinate or cheat due to mixed objectives.

OpenAI is experimenting with a new approach to AI safety: training models to admit when they've misbehaved.

In a study published Wednesday, researchers tasked a version of GPT-5 Thinking, the company's latest model, with responding to various prompts and then assessing the honesty of those responses. For each "confession," as these follow-up assessments were called, researchers rewarded the model solely on the basis of truthfulness: if it lied, cheated, hallucinated, or otherwise missed the mark, but then fessed up to doing so, it would receive the algorithmic equivalent of a piece of candy.

Also: Your favorite AI tool barely scraped by this safety review - why that's a problem

"The goal is to encourage the model to faithfully report what it actually did," OpenAI wrote in a follow-up blog post.

... continue reading