Tech News
← Back to articles

Why complex reasoning models could make misbehaving AI easier to catch

read original related products more articles

Ihor Tsyvinskyi/iStock / Getty Images Plus

Follow ZDNET: Add us as a preferred source on Google.

ZDNET's key takeaways

OpenAI published a new paper called "Monitoring Monitorability."

It offers methods for detecting red flags in a model's reasoning.

Those shouldn't be mistaken for silver bullet solutions, though.

In order to build AI that's truly aligned with human interests, researchers should be able to flag misbehavior while models are still "thinking" through their responses, rather than just waiting for the final outputs -- by which point it could be too late to reverse the damage. That's at least the premise behind a new paper from OpenAI, which introduces an early framework for monitoring how models arrive at a given output through so-called "chain-of-thought" (CoT) reasoning.

Published Thursday, the paper focused on "monitorability," defined as the ability for a human observer or an AI system to make accurate predictions about a model's behavior based on its CoT reasoning. In a perfect world, according to this view, a model trying to lie to or deceive human users would be unable to do so, since we'd possess the analytical tools to catch it in the act and intervene.

Also: OpenAI is training models to 'confess' when they lie - what it means for future AI

One of the key findings was a correlation between the length of CoT outputs and monitorability. In other words, the longer and more detailed a model's step-by-step explanation about its reasoning process, the easier it was to accurately predict its output (though there were exceptions to this rule).

... continue reading