Andriy Onufriyenko / Getty Images
Over the last year, chain of thought (CoT) -- an AI model's ability to articulate its approach to a query in natural language -- has become an impressive development in generative AI, especially in agentic systems. Now, several researchers agree it may also be critical to AI safety efforts.
On Tuesday, researchers from competing companies including OpenAI, Anthropic, Meta, and Google DeepMind, as well as institutions like the Center for AI Safety, Apollo Research, and the UK AI Security Institute, came together in a new position paper titled "Chain of Thought Monitorability: A New and Fragile Opportunity for AI." The paper details how observing CoT could reveal key insights about a model's ability to misbehave -- and warns that training models to become more advanced could cut off those insights.
(Disclosure: Ziff Davis, ZDNET's parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)
Also: AI agents will threaten humans to achieve their goals, Anthropic report finds
A model uses chain of thought to explain the steps it's taking to tackle a problem, sometimes speaking its internal monologue as if no one is listening. This gives researchers a peek into its decision-making (and sometimes even its morals). Because models reveal their rumination process through CoT, they can also expose motivations or actions that safety researchers want to quell, or at least know the LLM is capable of.
Models lie
By now, much research has established that models deceive, either to protect their original directives, please users, preserve themselves from being retrained, or, ironically, avoid committing harm. In December, Apollo published research testing six frontier models to determine which lied the most (it was OpenAI's o1). Researchers even developed a new benchmark to detect how much a model is lying.
Also: OpenAI used to test its AI models for months - now it's days. Why that matters
As AI agents get better at autonomous tasks -- and better at deceiving -- they've become equally opaque, obscuring the potential risks of their capabilities. Those risks are much easier to control if developers can interpret how an AI system is making decisions.
... continue reading