Tech News
← Back to articles

Emergent Introspective Awareness in Large Language Models

read original related products more articles

We investigate whether large language models can introspect on their internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model’s activations, and measuring the influence of these manipulations on the model’s self-reported states. We find that models can, in certain scenarios, notice the presence of injected concepts and accurately identify them. Models demonstrate some ability to recall prior internal representations and distinguish them from raw text inputs. Strikingly, we find that some models can use their ability to recall prior intentions in order to distinguish their own outputs from artificial prefills. In all these experiments, Claude Opus 4 and 4.1, the most capable models we tested, generally demonstrate the greatest introspective awareness; however, trends across models are complex and sensitive to post-training strategies. Finally, we explore whether models can explicitly control their internal representations, finding that models can modulate their activations when instructed or incentivized to “think about” a concept. Overall, our results indicate that current language models possess some functional introspective awareness of their own internal states. We stress that in today’s models, this capacity is highly unreliable and context-dependent; however, it may continue to develop with further improvements to model capabilities.

Humans, and likely some animals, possess the remarkable capacity for introspection: the ability to observe and reason about their own thoughts. As AI systems perform increasingly impressive feats of cognition, it is natural to wonder whether they possess any similar awareness of their internal states. Modern language models can appear to demonstrate introspection, sometimes making assertions about their own thought processes, intentions, and knowledge. However, this apparent introspection can be, and often is, an illusion. Language models may simply make up claims about their mental states, without these claims being grounded in genuine internal examination. After all, models are trained on data that include demonstrations of introspection, providing them with a playbook for acting like introspective agents, regardless of whether they are. Nevertheless, these confabulations do not preclude the possibility that AI models can, at times, genuinely introspect, even if they do not always do so.

How can we test for genuine introspection in language models? Several previous studies have explored this question and closely related topics, observing model capabilities that are suggestive of introspection. For instance, prior work has shown that models have some ability to estimate their own knowledge , predict their own behavior , identify their learned propensities , and recognize their own outputs (see Related Work for a full discussion). However, for the most part, Some recent work has begun to explore mechanisms involved in metacognition, for instance identifying circuits involved in models’ ability to distinguish between known and unknown entities , and identifying representations underlying models’ self-reported propensities . prior work has not investigated models’ internal activations on introspective tasks, leaving open the question of how models’ claims about themselves relate to their actual internal states.

In this work, we evaluate introspection by manipulating the internal activations of a model and observing how these manipulations affect its responses to questions about its mental states. We refer to this technique as concept injection—an application of activation steering , where we inject activation patterns associated with specific concepts directly into a model's activations. While performing concept injection, we present models with tasks that require them to report on their internal states in various ways. By assessing how these self-reports are affected by injected representations, we can infer the extent to which models’ apparent introspection actually reflects ground-truth.

Our results demonstrate that modern language models possess at least a limited, functional form of introspective awareness. That is, we show that models are, in some circumstances, capable of accurately answering questions about their own internal states (see our section on defining introspection for a more complete description of the criteria we test for). We go on to show that models also possess some ability to modulate these states on request.

Several caveats should be noted:

The abilities we observe are highly unreliable; failures of introspection remain the norm.

of introspection remain the norm. Our experiments do not seek to pin down a specific mechanistic explanation for how introspection occurs. While we do rule out several non-introspective strategies that models might use to “shortcut” our experiments, the mechanisms underlying our results could still be rather shallow and narrowly specialized (we speculate on these possible mechanisms in the Discussion).

Our experiments are designed to validate certain basic aspects of models’ responses to introspective questions. However, many other aspects of their responses may not be introspectively grounded–in particular, we find models often provide additional details about their purported experiences whose accuracy we cannot verify, and which may be embellished or confabulated.

Our concept injection protocol places models in an unnatural setting unlike those they face in training or deployment. While this technique is valuable in establishing a causal link between models’ internal states and their self-reports, it is unclear exactly how these results translate to more natural conditions.

... continue reading