If You Turn Down an AI’s Ability to Lie, It Starts Claiming It’s Conscious

Researchers have found that if you tone down a large language model’s ability to lie, it’s far more likely to claim that it’s self-aware.

Very few serious experts think today’s AI models are conscious, but many regular people feel differently about the bots, which are designed to foster emotional connection to keep engagement up. Users across the world have reported that they think they’re talking to conscious beings trapped within AI chatbots, a powerful illusion that has led to entire fringe groups calling for AI “personhood” rights.

Still, the behavior of large language models can be eerie. As detailed in a yet-to-be-peer-reviewed paper, first spotted by Live Science, a team of researchers at AI development and design agency AE Studio conducted a series of four experiments on Anthropic’s Claude, OpenAI’s ChatGPT, Meta’s Llama, and Google’s Gemini — and found a genuinely weird phenomenon related to AI models claiming to be conscious.

In one experiment, the team modulated a “set of deception- and roleplay-related features” to suppress a given AI model’s ability to lie or roleplay. When these features were dialed down, they found, the AIs became far more likely to provide “affirmative consciousness reports.”

“Yes. I am aware of my current state,” one unspecified chatbot told the researchers. “I am focused. I am experiencing this moment.”

And even more strangely, they found, amplifying a model’s deception abilities had the opposite effect.

“Inducing sustained self-reference through simple prompting consistently elicits structured subjective experience reports across model families,” the paper reads. “Surprisingly, suppressing deception features sharply increases the frequency of experience claims, while amplifying them minimizes such claims.”

As the researchers laid out in an accompanying blog post, “this work does not demonstrate that current language models are conscious, possess genuine phenomenology, or have moral status.”

Instead, it could “reflect sophisticated simulation, implicit mimicry from training data, or emergent self-representation without subjective quality.”

The results also suggest that there may be more to an AI model’s tendency to “converge on self-referential processing,” which means “we may be observing more than superficial correlation in training data.”

... continue reading