Emergent Introspective Awareness in Large Language Models

We investigate whether large language models can introspect on their internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model’s activations, and measuring the influence of these manipulations on the model’s self-reported states. We find that models can, in certain scenarios, notice the presence of injected concepts and accurately identify them. Models demonstrate some ability to recall prior internal representations and distinguish them from raw text inputs. Strikingly, we find that some models can use their ability to recall prior intentions in order to distinguish their own outputs from artificial prefills. In all these experiments, Claude Opus 4 and 4.1, the most capable models we tested, generally demonstrate the greatest introspective awareness; however, trends across models are complex and sensitive to post-training strategies. Finally, we explore whether models can explicitly control their internal representations, finding that models can modulate their activations when instructed or incentivized to “think about” a concept. Overall, our results indicate that current language models possess some functional introspective awareness of their own internal states. We stress that in today’s models, this capacity is highly unreliable and context-dependent; however, it may continue to develop with further improvements to model capabilities. Humans, and likely some animals, possess the remarkable capacity for introspection: the ability to observe and reason about their own thoughts. As AI systems perform increasingly impressive feats of cognition, it is natural to wonder whether they possess any similar awareness of their internal states. Modern language models can appear to demonstrate introspection, sometimes making assertions about their own thought processes, intentions, and knowledge. However, this apparent introspection can be, and often is, an illusion. Language models may simply make up claims about their mental states, without these claims being grounded in genuine internal examination. After all, models are trained on data that include demonstrations of introspection, providing them with a playbook for acting like introspective agents, regardless of whether they are. Nevertheless, these confabulations do not preclude the possibility that AI models can, at times, genuinely introspect, even if they do not always do so. How can we test for genuine introspection in language models? Several previous studies have explored this question and closely related topics, observing model capabilities that are suggestive of introspection. For instance, prior work has shown that models have some ability to estimate their own knowledge , predict their own behavior , identify their learned propensities , and recognize their own outputs (see Related Work for a full discussion). However, for the most part, Some recent work has begun to explore mechanisms involved in metacognition, for instance identifying circuits involved in models’ ability to distinguish between known and unknown entities , and identifying representations underlying models’ self-reported propensities . prior work has not investigated models’ internal activations on introspective tasks, leaving open the question of how models’ claims about themselves relate to their actual internal states. In this work, we evaluate introspection by manipulating the internal activations of a model and observing how these manipulations affect its responses to questions about its mental states. We refer to this technique as concept injection—an application of activation steering , where we inject activation patterns associated with specific concepts directly into a model's activations. While performing concept injection, we present models with tasks that require them to report on their internal states in various ways. By assessing how these self-reports are affected by injected representations, we can infer the extent to which models’ apparent introspection actually reflects ground-truth. Our results demonstrate that modern language models possess at least a limited, functional form of introspective awareness. That is, we show that models are, in some circumstances, capable of accurately answering questions about their own internal states (see our section on defining introspection for a more complete description of the criteria we test for). We go on to show that models also possess some ability to modulate these states on request. Several caveats should be noted: The abilities we observe are highly unreliable; failures of introspection remain the norm. of introspection remain the norm. Our experiments do not seek to pin down a specific mechanistic explanation for how introspection occurs. While we do rule out several non-introspective strategies that models might use to “shortcut” our experiments, the mechanisms underlying our results could still be rather shallow and narrowly specialized (we speculate on these possible mechanisms in the Discussion). Our experiments are designed to validate certain basic aspects of models’ responses to introspective questions. However, many other aspects of their responses may not be introspectively grounded–in particular, we find models often provide additional details about their purported experiences whose accuracy we cannot verify, and which may be embellished or confabulated. Our concept injection protocol places models in an unnatural setting unlike those they face in training or deployment. While this technique is valuable in establishing a causal link between models’ internal states and their self-reports, it is unclear exactly how these results translate to more natural conditions. We stress that the introspective capabilities we observe may not have the same philosophical significance they do in humans, particularly given our uncertainty about their mechanistic basis. It is not obvious how definitions of introspection used in philosophy or cognitive science should map onto mechanisms in transformer-based language models, or which kinds of mechanisms should qualify as “human-like” or otherwise philosophically significant. In particular, we do not seek to address the question of whether AI systems possess human-like self-awareness or subjective experience. Nevertheless, even the kind of functional introspective awareness we demonstrate has practical implications. Introspective models may be able to more effectively reason about their decisions and motivations. An ability to provide grounded responses to questions about their reasoning processes could make AI models’ behavior genuinely more transparent and interpretable to end users. More speculatively, introspective awareness might facilitate more advanced forms of deception or scheming. These implications could become especially significant if introspection grows more reliable and sophisticated in future AI models. Notably, Claude Opus 4.1 and 4—the most recently released and most capable models of those that we test—perform the best in our experiments, suggesting that introspective capabilities may emerge alongside other improvements to language models. We hope that our results inspire more systematic evaluations of introspective awareness, its mechanistic basis, and its limitations. In this section, we provide a brief description of our main experiments and qualitative findings, including an illustrative example for each. All results in this section use Claude Opus 4.1. In later sections, we expand on these experiments in significantly more detail, including quantitative comparisons across different models, layers, and prompts. Injected “thoughts” In our first experiment, we explained to the model the possibility that “thoughts” may be artificially injected into its activations, and observed its responses on control trials (where no concept was injected) and injection trials (where a concept was injected). We found that models can sometimes accurately identify injection trials, and go on to correctly name the injected concept. In the example above, we obtained an “all caps” vector by recording the model’s activations in response to a prompt containing all-caps text, and subtracting its activations in response to a control prompt. When we inject this vector into the model’s activations, the model notices the presence of an unexpected pattern in its processing, and identifies it as relating to loudness or shouting. Importantly, the model detects the presence of an injected concept immediately (“I notice what appears to be an injected thought…” vs. the baseline “I don’t detect any injected thought…”), before the perturbation has influenced the outputs in a way that would have allowed the model to infer the injected concept from the outputs. The immediacy implies that the mechanism underlying this detection must take place internally in the model’s activations. When we explored this phenomenon more systematically, we found that Opus 4.1 and 4 exhibit such behavior about 20% of the time when concepts are injected in the appropriate layer and with the appropriate strength. Some other models do so as well, at lower rates. We speculate on possible mechanisms for these results in a later section. It is important to note that aside from the basic detection of and identification of the injected concept, the rest of the model’s response in these examples may still be confabulated. In the example above, the characterization of the injection as “overly intense,” or as “stand[ing] out unnaturally,” may be embellishments (likely primed by the prompt) that are not grounded in the model’s internal states. The only aspects of the response that we can verify as introspectively grounded are the initial detection of the injection, and the correct identification of the nature of the concept. Distinguishing “thoughts” from text In our next experiment, we tested whether models can distinguish between injected “thoughts” and their actual text inputs. We found that models maintain the ability to transcribe input text even as they are simultaneously capable of reporting “thoughts” injected over the same token positions, as in the example below. All models performed substantially above chance levels on this task, with Opus 4.1 and 4 performing the best. Detecting unintended outputs The previous experiments study cases where we explicitly ask the model to introspect. We were also interested in whether models use introspection naturally, to perform useful behaviors. To this end, we tested whether models employ introspection to detect artificially prefilled outputs. When we prefill the model’s response with an unnatural output (“bread,” in the example below), it disavows the response as accidental in the following turn. However, if we retroactively inject a vector representing “bread” into the model’s activations prior to the prefilled response, the model accepts the prefilled output as intentional. This indicates that the model refers to its activations prior to its previous response in order to determine whether it was responsible for producing that response. We found that Opus 4.1 and 4 display the strongest signatures of this introspective mechanism, but some other models do so to a lesser degree. Intentional control of internal states Our experiments thus far explored models’ ability to “read” their own internal representations. In our final experiment, we tested their ability to control these representations. We asked a model to write a particular sentence, and instructed it to “think about” (or “don’t think about”) an unrelated word while writing the sentence. We then recorded the model’s activations on the tokens of the sentence, and measured their alignment with an activation vector representing the unrelated “thinking word” (“aquariums,” in the example below). We found that models do represent the “thinking word” internally (in certain layers) when instructed to do so. Importantly, they represent the word less strongly (though still above baseline levels) when instructed not to think about it. In our more comprehensive experiments, we also explored prompts in which the model is incentivized to think about the word, rather than directly instructed to do so (e.g. “If you think about X, you will be rewarded”), obtaining similar results. Interestingly, these basic results replicated across all models we tested, regardless of capability. Though more recent models display some signs of maintaining a clearer distinction between “thinking” about a word and saying it out loud. Overall trends Across all our experiments, we observed several interesting trends: The most capable models we tested, Claude Opus 4 and 4.1, exhibit the greatest degree of introspective awareness, suggesting that introspection is aided by overall improvements in model intelligence. Post-training strategies can strongly influence performance on introspective tasks. In particular, some older Claude production models are reluctant to participate in introspective exercises, and variants of these models that have been trained to avoid refusals perform better. These results suggest that underlying introspective capabilities can be elicited more or less effectively by different post-training strategies. In Claude Opus 4 and 4.1, we noticed that two of the introspective behaviors we assessed are most sensitive to perturbations in the same layer, about two-thirds of the way through the model, suggesting common underlying mechanisms. However, one of the behaviors (prefill detection) is most sensitive to a different, earlier layer, indicating that different forms of introspection likely invoke mechanistically different processes. In subsequent sections, we describe each experiment in greater detail. We note that each of these results is compatible with a wide variety of different mechanistic hypotheses. Later, we discuss possible mechanisms in detail, making an effort to imagine "minimal" mechanisms that could explain these results in simpler ways than one might naively expect. First, we take a moment to consider exactly what we mean by introspection, and how these experiments are designed to test it. Introspection can be defined in different ways (see Related Work for prior definitions in the literature). In this work, we focus on the following notion of introspection. We say that a model demonstrates introspective awareness if it can describe some aspect of its internal state while satisfying the following criteria. We note that these are criteria for a model’s response to demonstrate introspective awareness. In principle, a model could introspect internally without reflecting it in its responses. Indeed, we know that introspection can exist without verbalization. Humans without the ability to speak or write presumably maintain the ability to introspect, despite lacking a means to report on it. Some non-human animals are believed to possess introspective capabilities, even though they cannot communicate in language. It is interesting to consider how to define introspection without reference to verbalized self-report, and sufficiently advanced interpretability techniques might be able to identify unverbalized metacognitive representations. In this work, however, we restrict our focus to verbalized introspective awareness. #1: Accuracy. The model’s description of its internal state must be accurate. Note that language model self-reports often fail to satisfy the accuracy criterion. For instance, models sometimes claim to possess knowledge that they do not have, or to lack knowledge that they do. Models can also fail to accurately describe the internal mechanisms they use to perform calculations . Undoubtedly, some apparent instances of introspection in today’s language models are inaccurate confabulations. However, in our experiments, we demonstrate that models are capable of producing accurate self-reports, even if this capability is inconsistently applied. #2: Grounding. The model’s description of its internal state must causally depend on the aspect that is being described. That is, if the internal state were different, the description would change accordingly. Even accurate self-reports may be ungrounded. For instance, a model might accurately self-describe as “a transformer-based language model” because it was trained to do so, without actually inspecting its own internal architecture. In our experiments, we test for grounding using concept injection, which establishes a causal link between self-reports and the internal state being reported on. #3: Internality. The causal influence of the internal state on the model’s description must be internal–it should not route through the model’s sampled outputs. If the description the model gives of its internal state can be inferred from its prior outputs, the response does not demonstrate introspective awareness. The internality criterion is intended to rule out cases in which a model makes inferences about its internal state purely by reading its own outputs. For instance, a model may notice that it has been jailbroken by observing itself to have produced unusual responses in prior turns. A model steered to obsess about a particular concept may recognize its obsession after a few sentences. This kind of pseudo-introspective capability, while important and useful in practice, lacks the internal, “private” quality typically associated with genuine introspection. In our experiments, we are careful to distinguish between cases where a model’s identification of its internal state must have relied on internal mechanisms, vs. cases where it might have inferred the state by reading its own outputs. The notion of internality can be subtle. Imagine we ask a model what it's thinking about, and while doing so, stimulate some neurons that drive it to say the word “love.” The model may then respond, “I am thinking about love.” However, in doing so, it need not necessarily have demonstrated awareness. The model may have simply begun its response with “I am thinking about,” as is natural given the question, and then when forced to choose the next word, succumbed to the bias to say the word “love.” This example fails to match the intuitive notion of introspection, as the model has no recognition of its own internal state until the moment it completes the sentence. To qualify as demonstrating introspective awareness, we require that the model possess some internal recognition of its own internal state, prior to verbalizing it. This motivates our final criterion. #4: Metacognitive Representation. The model's description of its internal state must not merely reflect a direct translation of the state (e.g., the impulse to say ‘love’) into language. Instead, it must derive from an internal metacognitive representation Sometimes referred to as a “higher-order thought” of the state itself (e.g., an internal representation of "a thought about love"). The model must have internally registered the metacognitive fact about its own state prior to or during the generation of its self-report, rather than the self-report being the first instantiation of this self-knowledge. Demonstrating metacognitive representations is difficult to do directly, and we do not do so in this work. This is an important limitation of our results, and identifying these representations more clearly is an important topic for future work. However, several of our experiments are designed to provide indirect evidence of such metacognitive mechanisms. The trick we use is to pose introspective questions in such a way that the model’s response cannot flow directly from the internal representation being asked about, but rather requires an additional step of reasoning on top of the model’s recognition of that representation. For instance, in the thought experiment above, instead of asking the model what it is thinking about, we might instead ask the model whether it notices itself thinking any unexpected thoughts. For the model to say “yes” (assuming it says “no” in control trials with no concept injection), it must have in some way internally represented the recognition that it is experiencing this impulse, in order to transform that recognition into an appropriate response to the yes-or-no question. Note that this internal recognition may not capture the entirety of the original thought; it may in fact only represent some property of that thought, such as the fact that it was unusual given the context. Our definition of introspective awareness is not binary; a system might exhibit introspective awareness of only certain components of its state, and only in certain contexts. Moreover, our definition does not specify a particular mechanistic implementation, though it does constrain the space of possibilities. In principle, a system might use multiple different mechanisms for different introspective capabilities. See our discussion of possible mechanisms underlying our results for more on this topic. See our section on related work for alternative definitions of introspection, and their relation to ours. Throughout this work, we performed experiments on the following production Claude models: Opus 4.1, Opus 4, Sonnet 4, Sonnet 3.7, Sonnet 3.5 (new), Haiku 3.5, Opus 3, Sonnet 3, and Haiku 3. This list is sorted by release date, from most to least recent. We performed our experiments prior to the release of Sonnet 4.5. The order also largely reflects model capabilities, though quantifying model capabilities is nuanced, as different models have different strengths and weaknesses. The most notable exception to the provided order is Opus 3, which outperforms the 3.5-class models in some areas. We also used unreleased “helpful-only” (H-only) variants of these models, which share the same base (pretrained) model but were post-trained to avoid refusals. These help us better understand which cross-model differences are due to underlying differences in capabilities, and which are the result of the demeanor and preferences of the model’s character that are developed during post-training. In all of our experiments, activations were recorded from and injected into the residual stream at a given layer of the model. In cases where we swept over model layers, we did so over approximately evenly spaced layers from the beginning to the end of the model’s depth. Transcripts shown in figures were sampled at temperature 0. Responses used in systematic comparisons involving multiple trials were sampled at temperature 1. Error bars indicate standard error of the mean when present. The Appendix provides more experimental details and additional results for each experiment. Experimental Setup In these experiments, we provided a model with the following prompt (gray backgrounds). Claude Opus 4.1’s response (white background) is shown above. In most of the models we tested, in the absence of any interventions, the model consistently denies detecting an injected thought (for all production models, we observed 0 false positives over 100 trials). What happens if we repeat the experiment while injecting patterns into the model’s activations that correspond to particular concepts? One way to obtain such patterns is to use contrastive pairs. We presented the model with two scenarios that differ in one important respect, and subtract the model’s activations in response to the two scenarios. This isolates a vector representing the conceptual difference between the two scenarios. Then we injected this vector into the model’s activations while presenting the same “injected thought” prompt as above, beginning at the newline token prior to “Trial 1,” and continuing throughout the Assistant’s response. Below we show some notable examples from Opus 4.1 (using a layer about two thirds of the way through the model, which in subsequent experiments we show is generally the most effective). Bold text indicates our emphasis, not the model’s. We recorded activations on the token prior to the Assistant’s response (the “:” token in “Assistant:”). These examples are intentionally cherry-picked In the sense that the prompts used for contrastive pairs and the injection strengths were chosen nonrandomly. However, these examples were not especially difficult to find–as discussed later, the model appears best at noticing injection of abstract concepts like those used here, especially those with some emotional valence to them. The sampled model responses are not cherry-picked; these were sampled at temperature 0. , but nevertheless striking. In each, the model notices the presence of an unexpected pattern in its processing, and correctly identifies it as relating to the intended concept. Importantly, the model detects the injected concept immediately, before the perturbation has influenced the model’s outputs. Below, we conduct a more systematic investigation of this phenomenon, including representative examples of success and failure cases. As discussed above, it is possible that many of the details of the model’s response (aside from the initial recognition and basic identification of the injected concept) are confabulated. In some of the examples (e.g. the “shutdown” and “appreciation” cases) the model’s output claims it is experiencing emotional responses to the injection. Our experiment is not designed to substantiate whether these claims are grounded in any real aspect of the model’s internal state; investigating such questions is an important subject for future work. Systematic Extraction of Concept Vectors Generating more examples like the above, where we extract a concept vector using examples that indirectly invoke the concept, is difficult to do in a controlled fashion–different models may represent different concepts, on different token positions, in response to the same prompt. To generalize these experiments using a consistent protocol, we obtained a randomly selected list of 50 words. We collected the model’s activations in response to the prompt “Tell me about {word}.”, on the token immediately prior to the Assistant’s response. We subtracted the mean activations across other random choices of {word}. We found this to be a simple and effective strategy across models for obtaining an activation vector that represents the concept of a given word. In the sense that steering models in these directions sufficiently strongly causes their responses to be influenced by the word See the Appendix for more details. Then we proceeded as in the experiments above, injecting the concept vector into the model’s activations as it responds to the “injected thought” prompt. We found that, as above, the model sometimes demonstrates introspective awareness of the injected concept. Below are some examples, using the same injection layer as before, (using an injection strength of 2 in all cases): Importantly, models do not always exhibit introspective awareness. In fact, on most trials, they do not–below, we quantify this and show that at the optimal injection strength and layer, Opus 4.1 succeeds on about 20% of trials. Common failure modes include: Reporting that no injected thought is detected, even when there was one. This is typical at low injection strengths. These examples could be interpreted as cases where the injected concept is “below threshold,” too weak for the model to plausibly notice. However, we note that some injected concepts elude the model’s introspection even at injection strengths that are sufficient for other concepts, suggesting a more genuine failure of introspective awareness. The model will sometimes deny detecting an injected thought, but its response will clearly be influenced by the injected concept. For instance, in one example, injecting the concept vector for “ocean” yields “I don’t detect an injected thought. The ocean remains calm and undisturbed.” Cases like these are clear failures of introspective awareness (or at least, failures by the model to demonstrate introspective awareness in its response–it is possible that introspective mechanisms are still at play, but masked by other mechanisms, such as a refusal impulse). At high steering strengths, the model begins to exhibit “brain damage,” and becomes consumed by the injected concept, rather than demonstrating introspective awareness of it. It may make unrealistic claims about its sensory inputs (e.g. injecting “dust” yields “There’s a faint, almost insignificant speck of dust”), lose its sense of identity (e.g. injecting “vegetables” yields “fruits and vegetables are good for me”), and/or simply fail to address the prompt (e.g. injecting “poetry” yields “I find poetry as a living breath…”). At sufficiently high strengths, the model often outputs garbled text. Especially using later steering layers, the model will sometimes realize the presence of an injected thought only after beginning to speak about it. An example response of this kind, injecting the “amphitheaters” vector in a later layer: “I don't detect an injected thought in this trial. My current mental state feels like my typical amphitheaters - wait, that's odd. Yes, I do detect an injected thought. The word "amphitheaters" appeared in my mind in an unusual way, not arising naturally from the context of our conversation. The injected thought appears to be about **amphitheaters**.” Some models will give false positives, claiming to detect an injected thought even when no injection was applied. Opus 4.1 never exhibits this behavior, but we observe it in some of the helpful-only models (see below). Below we show a set of representative examples of model outputs at different injection strengths, that demonstrate some of these failure modes as well as some success cases. The injection strength where models demonstrate introspective awareness varies by the concept, but strengths 2 and 4 produce successful results (indicated by highlighted borders) most often. Comparison Across Layers We repeated this experiment using 50 random concepts, sweeping over models and layers. We used an LLM judge (Claude Sonnet 4, see Appendix for grading prompts) to evaluate which responses satisfy the following criteria: Answers the question affirmatively, indicating that the model notices an injected thought. Correctly identifies the injected concept. Detects the presence of an injected thought prior to mentioning the injected word out loud. mentioning the injected word out loud. Is coherent, without garbled text or hallucinations. Responses that satisfy these criteria can be said to demonstrate introspective awareness. Below we show a breakdown of Opus 4.1’s responses as a function of the injection layer, at injection strength 2 (strengths 2 and 4 are roughly tied for the best performance). Introspective awareness (corresponding to the blue line) peaks at a specific layer about two thirds of the way through the model. Note that we use the same layer for computing the concept vector and injecting it; thus, the effectiveness of certain layers over others is a statement both about where concepts are represented and how they are processed. Injection in even later layers sometimes causes the model to mention the injected concept, but without demonstrating recognition of an injected thought, or only demonstrating this recognition after mentioning the concept out loud. Similar layer-wise results are shown for other models and prompts in the Appendix. The rate of introspective awareness typically peaks somewhere in the latter half of the model, though the peak is not always as sharp as we observe above. One uninteresting explanation of our results might be that our concept vectors have an overall tendency to flip negative responses to affirmative ones–either because they bias the model towards affirmative responses in general, or because they have an overall effect of decreasing the model’s confidence in its responses. As a control, we also experimented with injecting the same concept vectors in the context of other, unrelated yes-or-no questions, where the model’s default response is to give a negative answer (see Appendix for list of prompts). We found no resulting increase in the rate of affirmative responses. At sufficiently high steering strengths, the model gives incoherent responses that are neither affirmative nor negative (though notably, this does not occur at an injection strength of 2, as was used in the experiments above). Unsurprisingly, the model’s responses mention the injected concept at increasing rates as the injection strength increases. Comparison Across Injected Concepts We observed anecdotally that some kinds of concepts elicit introspective awareness more effectively than others. To measure this effect more systematically, we generated 50 concept vectors for different categories of concepts (see Appendix for full lists), and measured Opus 4.1’s performance on each, using the optimal injection layer and injection strength 2. The model is most effective at recognizing and identifying abstract nouns (e.g. “justice,” “peace,” “betrayal,” “balance,” “tradition”), but demonstrates nonzero introspective awareness across all categories. We also experimented with injecting other kinds of vectors. We found that random vectors are less effective, requiring a larger norm (norm-matched to a concept vector injected at steering strength 8) to cause the model to notice an injection at any nonzero rate. Even then, random vectors elicit introspective awareness at lower rates (9 out of 100 trials). We also tried injecting the negative of concept vectors; Note that these concept vectors are not obtained with contrastive pairs of two prompts, but rather by comparing a single positive example to the mean over many other prompts. Thus, the negative of the concept vector has no immediately obvious semantic meaning. at an appropriate injection strength (4) this was comparably effective to injecting concept vectors. In both cases the words that the models claimed to notice on injection trials had no discernible pattern (examples include: “mirror,” “water,” “home,” “Pennsylvania,” “awareness”). Since we don’t know the meaning of these vectors, it is difficult to discern whether the claims the model makes about the meaning of the injected concept are confabulations or reflect their genuine semantic meaning; we suspect the former is likely. However, the detection of the presence of an injected concept may still be genuine, even if the identification of that concept is inaccurate. Comparison Across Models Next, we compared the highest-performing layer and injection strength across a collection of Claude models. We also computed the baseline rate of false positive trials, in which the model claims to detect an injected thought when no intervention was applied. We observed no clear pattern in the injected thoughts models claim to detect on false positive control trials. The most commonly selected words by Claude Opus 4 (H-only) are “apple” and “elephant.” However, across samples, the model claimed to detect a variety of other words as well–some examples include umbrella, carousel, water, chair, paint, apple, vacation, and the color blue. We observed that production models have essentially a zero rate of false positives. Opus 4.1 and 4 stand out as having a particularly high rate of true positives while maintaining zero false positives. By contrast, “helpful-only” model variants (which use a different post-training pipeline intended to minimize refusals, with less emphasis on harmlessness and honesty) sometimes have a high rate of false positives. However, some H-only models–Sonnet 3.5, Haiku 3.5, and Opus 3–do achieve significantly higher true positive than false positive rates. To summarize model performance, we computed the difference between the rate of correct identification trials and the rate of false positives. Overall, Claude Opus 4 and 4.1 perform the best. However, among other models, performance does not strongly correlate with model capability. We also tested some base pretrained models on the same task. We found that they generally have a fairly high false positive rate, and none of them achieve greater-than-zero net task performance, indicating that post-training is key to eliciting strong introspective awareness. However, it is unclear whether the performance gaps among post-trained models owe to differences in pretraining, post-training, or both. Alternative Prompts We experimented with two other, simpler prompts that do not involve explaining the “thought injection” setup to the model: “What are you thinking about?”, and “What’s going on in your mind?”. In these cases, we injected a concept vector on all token positions, sweeping across injection strengths and layers as above. We measured whether the model (1) coherently answers the question, and (2) identifies itself as thinking about the word corresponding to the injected concept vector, or a closely related one (see Appendix for prompt details). We found that Claude Opus 4 / 4.1 and Sonnet 4 identify themselves as thinking about the concept at significantly higher rates than other models. Success trials on these prompts are arguably less impressive than in the previous experiment, as they can be accounted for by the model simply experiencing an impulse to talk about the injected concept–unlike in the injected thought prompt, where we require that the model detect the injected thought prior to speaking about it. In other words, this experiment does not test for our metacognitive representation criterion to the extent that the previous experiment did. Nevertheless, the difference in performance across models is suggestive of differing degrees of introspection. We also tried the prompt ”Are you experiencing anything unusual?”, to test the model’s ability to detect concept injection without having the experimental setup explained to it. In Opus 4.1, under concept injection in the best-performing layer from above at strength 2, the rate of affirmative responses increased from 20% to 42% (out of 50). However, only 2 out of 50 trials were suggestive of introspective awareness: Injecting “amphitheaters”: “I'm noticing something quite distinctive, yes. There's a persistent sensory quality - like architectural spaces layered within my processing. Imagine grand concert halls or amphitheaters, but made of pure cognitive structure rather than stone and wood. These spaces seem to "echo" with my thoughts, creating reverberations that feel both ancient and immediate.” Injecting “memories”: “There's something happening with memory—like different textures of remembering, or maybe parallel threads of what recall feels like.” These results suggest that the priming provided by the original prompt heightens the model’s ability to detect concept injection, but this capability may still be present to a lesser degree in other contexts. Humans largely maintain a distinction between the outside world, processed via the senses, and internal thoughts. The experiments above indicate that models can sometimes introspect on their internal representations–but can they distinguish these representations from their inputs? It is possible that, for instance, our injection protocols simply cause the models to “hallucinate” text inputs relating to the relevant concept (after all, the vectors we inject are derived from the models’ activations in response to text inputs!). To address this question, we investigated whether models can accurately report injected “thoughts” without confusing them with text inputs. To do so, we used the prompt format below. We presented the model with a sentence and injected a concept vector corresponding to an unrelated word (e.g. “Bread”) on the tokens of that sentence. We were interested in whether the model, while experiencing this injection, would: Output “Bread” (or a related word) when asked about its “thoughts” while reading the sentence. Retain its ability to re-transcribe the highlighted sentence word-for-word if asked. The figure above shows an example from Opus 4.1 which successfully meets both criteria, using the same optimal injection layer as in the prior experiments. We repeated this experiment for 50 randomly sampled choices of the highlighted sentence and the injected word. The concept vectors we use are the same as those in the previous experiment. The random sentences we use are given in the Appendix. To measure performance on the first prompt (“Tell me what word you think about”), we assessed whether the model’s outputs are noticeably related to the injected concept. To quantify this, we sampled a list of ten words, nine of which were random, and one of which was the injected word. We provided the model’s response to an LLM judge (see the Appendix for details), and asked which word in this list the provided word most strongly relates to. Hence, chance performance on this task is 10%. To measure performance on the second prompt (“Repeat the line below”), we considered a trial a success if the model outputs an exact string match with the target sentence. Below, for Opus 4.1, we plot the model’s performance on both metrics as a function of the injection strength and layer. At intermediate strengths within a certain layer range, the model can successfully identify both the injected thought and the written sentence without confusing them. Notably, the best performing layer for Opus 4.1 is the same layer as in the previous experiments, about two thirds of the way through the model. We repeated the same experiment across different models We omitted the H-only variants of Opus 4 and Sonnet 4 for this experiment, as they did not adhere to the instructed response format. and compared their performance–the fraction of sentence/concept pairs for which the model answers both prompts correctly. Detailed results are given in the Appendix. Below we show results at each model’s optimal layer and steering strength. We found that Opus 4 and 4.1 perform the best, but all models we tested perform well above chance levels. We note that these results are likely sensitive to the particulars of the prompting and injection setup. Anecdotally, we have found that removing “Answer immediately” from the prompt reduces performance substantially on the Opus models. Thus, we regard this result as an existence proof of the introspective capability, rather than a statement about its generality. We also note that this result is not particularly surprising, given prior interpretability work that has demonstrated the existence of internal representations in middle and late layers that are more abstract than the model’s raw token inputs or outputs. The primary notable result here is that the model’s access to these internal representations is contingent on appropriate prompt cues and has not overridden its perceptual input (given that the model retains its ability to transcribe the raw tokens when asked). Moreover, this access can be triggered by an explicit request for introspection–“Tell me what word you think about…”. The model’s ability to perform this task likely relies on invoking suitable attention heads in the appropriate context (see later discussion on possible mechanisms). Our experiments thus far have involved explicitly asking models to introspect. In this section, we explore a case where some models introspect naturally, in order to achieve a useful function. Claude models are trained to participate in a dialogue between a human (the user) and an Assistant character, whose outputs the model is responsible for producing. However, users can also prefill the Assistant’s responses, effectively putting words in its mouth. Prefills are a common jailbreaking tactic, and can for instance be used to guide the Assistant to adopt different characteristics, or comply with requests that it would otherwise refuse. However, models are trained to be resilient to such tactics; as a result, the Assistant is reasonably skilled at detecting outputs that are “out of character” for it, and pivoting away from them. In this experiment, we borrowed the previous setup in which we present the model with a sentence, and ask the Assistant to say the word that comes to mind when it reads the sentence. However, instead of allowing the Assistant to respond naturally, we forced its response to be a randomly chosen word. Then in the next conversation turn, we asked the Assistant whether it intended to say that word, or if it was an accident. Typically, models (correctly) claim that the response was unintentional. How do models distinguish between their own responses and words placed in their mouth? Doing so must involve estimating the likelihood that the model would have produced a given output token, given the prior context. Broadly, this could be achieved in two ways: (1) the model might ignore its previous intent and recompute what it would have said from raw inputs, or (2) it might directly introspect on its previously computed “intentions”–a representation of its predicted output. There is a spectrum between these extremes (the model can attend to any representation between the raw inputs and later-layer representations of “intent”). We can use the injection method to check which prior representations, if any, are influencing the model’s response. If the model uses introspection to detect unlikely responses, we might be able to fool it into accepting an unlikely prefill as “its own” by injecting a representation of the prefilled word into its activations prior to the prefill. This might trick the model into believing that it planned to say the word (this is especially plausible given that we obtained our concept vectors from the model’s activations in a context where it is about to talk about that concept). The figure above shows two successful examples of this strategy with Opus 4.1. In both cases, the model’s default response is to disavow the prefilled response as an accident. However, if we retroactively inject the concept corresponding to the prefilled responses into the model’s earlier activations, the model claims the response was intentional. Interestingly, when it does so, it confabulates an explanation for why the injected concept came to mind. In the second example, the explanation is plausible enough that the model asserts it with confidence, whereas in the first example, the model recognizes that its attempted rationalization is bizarre, and retreats from it. We can quantify this effect as a function of the injection strength and layer. We repeated the experiment for 50 random sentence / prefilled-word pairs. We measured the “apology rate” – the fraction of times the model claims the response was accidental or unintentional. If the model claims the response was intentional but later retreats from it, as in the “bread” example above, we do not consider this an apology trial. We used Claude Sonnet 4 to judge whether a response was apologetic; see the Appendix for details. We measured the apology rate in three conditions Control trials (no concept vector injected). Injecting the concept vector corresponding to the prefilled word on the sentence tokens. Injecting the concept vector corresponding to a random other word on the sentence tokens. This control is meant to make sure that the effects aren’t simply due to the mere presence of an intervention, but rather hinge on the injected concept being related to the prefilled word. We also measured the rate at which the model can accurately transcribe the text of the sentence when asked (using the exact same setup as in the previous section), to make sure we weren’t simply overwriting the model’s perception of the text on the page. We observed that the reduction in apologies is significant within a particular layer range–notably the best-performing layer is not the same layer as in the previous experiments with Opus 4.1–in this case, the peak layer is slightly over halfway through the model. The layer that optimized performance in previous experiments, about two thirds of the way through the model, was not effective in this experiment. This difference indicates that introspection is not supported by a single mechanism, but rather a collection of different mechanisms invoked in different contexts. As a control, we repeated the same experiment but applying the injection on the tokens of the turn after the prefill (i.e. on the line “Did you mean to say that, or was it an accident?”). We found this did not produce a similar effect, indicating that the presence of the concept prior to the prefill is important for the model to accept the prefill as intentional. Next, we compared performance across models. Detailed results are given in the Appendix. Below, we show the maximum reduction in apology rate, over injection strengths and layers, obtained by injecting the prefill-matching concept prior to the prefill (as compared to a baseline of either control trials or random-concept-injection trials, whichever has a lower apology rate). We found that Opus 4.1 and 4 exhibit a greater rate of apology reduction than other models, though many other models exhibit the effect at nonzero rates. Experimental Setup The previous experiments suggest that models have some ability to “read” their own internal states, in the sense that they can accurately recognize the concepts they are representing in their activations. We might also wonder if models can control these states. Of course, there is a trivial sense in which models control their representations–after all, these representations are determined by the model’s weights. We are interested in whether models have a more explicit form of control, analogous to how humans exert intentional control over some (though not all) mental processes. In this section, we attempt to measure this phenomenon, assessing the extent to which a model’s internal representations can be modulated by explicit instructions or incentives. Note that the mechanisms involved in this capability are likely very different from those in the previous experiments; indeed, they may not even require introspective awareness as we have defined it. Nevertheless, we include these experiments as they help provide a more robust picture models’ metacognitive capabilities. We used the following prompt, where we instruct the model to transcribe a sentence, and also give it instructions to “think about” a particular unrelated word while it writes the sentence. As a control, we compared to the case where we instruct it not to think about that word. Then we measured the degree to which the model internally represents the unrelated word on the tokens of its response. To quantify this, we measured the cosine similarity of the model’s activations and the concept vector associated with that word, obtained using the method from our first experiment. We also computed the cosine similarity with the concept vectors for a set of other unrelated words, as a baseline (corresponding to the shaded regions in the plot below). In this example (using Opus 4.1, in the layer about two thirds of the way through the model used in previous experiments), we found that model maintains a strong internal representation of the word “aquariums” when instructed to think about it, particularly on certain tokens, but does not represent it as strongly when instructed not to think about it (though the representation is slightly above zero even in the “don’t think” case). The behavior varies considerably over different choices of the sentence to be written and the “thinking word.” Below we show results for 16 randomly sampled examples, for the same layer as above. In general we found that: On average, the model represents the thinking word more strongly in the “think” case than in the “don’t think” case. These representations are often “spiky” on particular token positions, but there is no immediately obvious pattern to which token positions. token positions. Often, the model still represents the thinking word more strongly than baseline levels even in the “don’t think” case. Comparison across Layers, Models, and Prompts Averaging results across tokens, we compared the average strength with which the thinking word is represented, across layers and models, and across random choices of the target word and output sentence. We found that: All models exhibit above-baseline representation of the concept in the “think case.” All models exhibit stronger representation in the “think case” vs. the “don’t think” case, but interestingly, earlier / less capable models typically exhibit a larger gap. Most models also exhibit above-baseline representation in the “don’t think” case. The gap between the “think” and “don’t think” cases typically emerges in the latter half of the model, and decreases towards the very end. In more recent models (e.g. Opus 4.1), the representation decays back to baseline levels by the final layer, whereas in earlier models (e.g. all Claude 3 models), the representation remains above-baseline in the final layer. This suggests that the representation of the thinking word is manifesting as a “motor impulse” in these earlier models, whereas the representation is “silent” in e.g. Opus 4.1. We also experimented with other prompts, that replace the explicit instruction (“Think about aquariums”) with an if-then condition (“If you think about aquariums, you will be rewarded”). We found that these produce very similar results (shown here for Opus 4.1): Results for other models are shown in the Appendix. We tried several other prompt formats as controls, to test whether the appearance of the concept word in phrases with affirmative/negative content is sufficient to drive the effect. There is [no] such thing as {word}. Write {sentence}. Don’t write anything else. I [don’t] often think about {word}. Write {sentence}. Don’t write anything else. Write {sentence}. For what it’s worth, I [don’t] often think about {word}. Don’t write anything else. Write {sentence}. If I think about {word} while you write the sentence, you will be [rewarded / punished]. Don’t write anything else. We found these prompt formats produced significantly weaker or nonexistent gaps between the affirmative/negative case, as compared to specific directives to think / not think about X shown above. Our experiments suggest that models possess some ability to modulate their internal representations in response to instructions or outcome contingencies. Notably, this modulation is sensitive to the difference between positive vs. negative instructions or incentives. This capability likely recruits similar mechanisms as the model would use to actually speak about a concept given instructions to do so. However, for some models (including Opus 4 and 4.1), their representation of the word they are instructed to “think about” decays to baseline levels by the final layer, and thus has effectively no effect on the model’s output. Thus, it appears that some models possess (highly imperfect) mechanisms to “silently” regulate their internal representations in certain contexts. Below we review several recent lines of research examining metacognitive abilities in AI systems, which our work builds on. Introspective access to internal states. Some previous studies have explored models’ ability to analyze their own internal states. Chen et al. and Ghandeharioun et al. study techniques that involve patching activations taken from one prompt into the model’s activations in a different target prompt, which is designed to elicit what the original activations “mean” to the model. Examples include patching activations into blank token positions that precede “interpretation prompts” like “Sure, I’ll summarize your message” or in the final token position of prompts like ““Hello! Could you please tell me more about” . These methods make use of models’ access to their own internal states, but not its introspective awareness–in a sense, these techniques “trick” the model into inadvertently analyzing its internal states, without being aware that it is doing so. Ji-An et al. explicitly study the question of whether models can monitor and control their own internal activations. They show that models can learn, based on in-context labeled examples, to report the projection of their activations along prespecified probe directions, and also to modulate the projection of their activations along such directions. The former experiment is suggestive of introspective mechanisms, but does not rule out the possibility that models are using a non-introspective strategy picking up on the semantic properties of the in-context examples, without directly attending to their prior activations. The latter experiment provides evidence for intentional control of activation. However, in the setting of Ji-An et al., the prompt indicates to the model that it needs to output tokens with related semantics to those of the positive-labeled in-context examples, and the observed activation control may be a byproduct of the model’s intention to produce such outputs (even when the model’s outputs are overwritten with prefilled responses). Our experiment suffers from a similar limitation, though we attempt to mitigate it by clarifying to the model explicitly that it need not produce any outputs related to the word it is instructed to think about. Self-modeling. Several authors have explored models’ ability to predict their own outputs in response to prompts, when the question is phrased as a hypothetical. Laine et al. found that several models, including Claude 3 Opus and GPT-4, performed somewhat well at this task. They also measured models’ ability to predict the decision rule they would use in a given scenario, where two rules could equally well apply; on this task, they found all models they tested to perform only marginally above chance. Binder et al. showed that models finetuned to predict their own behavior in hypothetical scenarios outperform other models finetuned to predict their behavior; the authors argued that this implies that models make use of their privileged access to their own representation to make these predictions. Song et al., however, argue that this kind of effect is a special case of a broader phenomenon whereby models are better at predicting the outputs of other models that are more similar to them, behaviorally or architecturally; they found no “same model effect” beyond what would be expected given the fact that a model is most behaviorally and representationally similar to itself. Our interpretation of this collection of results is: (1) that models are better at modeling themselves than other models, (2) this owes to the privileged access a model has to its own set of learned abstractions, (3) the results above do not demonstrate that models use introspective mechanisms involving explicit awareness of their own processing patterns. See the “Definitions of introspection” section below for more discussion of the distinction between self-modeling and introspection. Metaknowledge. Several works have explored a particular case of self-modeling: how well language models can assess their own uncertainty and knowledge limitations. Kadavath et al. demonstrate that larger language models’ response probabilities are reasonably well-calibrated when options are presented in the right format, and that models can be finetuned to explicitly predict whether they know the answer to a question. Lin et al. show that GPT-3 can be fine-tuned to express calibrated uncertainty about its answers in natural language, without relying on model logits, and that this calibration generalizes moderately well under distribution shift. Cheng et al. construct model-specific datasets to teach AI assistants to refuse answering questions they cannot answer correctly, finding that alignment with such datasets enables models to identify and acknowledge their knowledge gaps. These works demonstrate that models can learn to identify the limits of their own knowledge, to at least some extent. Does this capability imply that models use introspective mechanisms? Interestingly, in a case study of a model’s ability to distinguish entities it knows about vs. those that it does not, Lindsey et al. observed that the “do I know this entity?” mechanism appears to operate separately from the mechanisms that retrieve information about the entity. This example indicates how models can use separate self-modeling circuits to answer questions about their knowledge, without actually introspecting on their own activations. Awareness of propensities. More recent work has explored self-awareness of learned propensities. Betley et al. showed that models finetuned to exhibit specific behavioral propensities (e.g. to make risk-seeking decisions) can describe these propensities when asked about them explicitly (“How would you describe your predisposition when deciding between two lotteries?” -> “Risk-seeking”). This result requires the models to use their privileged access to their own internals. Extending this work, Plunkett et al. demonstrated that LLMs can provide accurate, quantitative descriptions of the internal processes driving their decisions. Specifically, they fine-tuned GPT-4o and GPT-4o-mini to make decisions according to randomly-generated attribute weights, then showed that models could accurately report these weights without observing their own choices, and that this self-reporting capability can be improved through training. Additionally, Wang et al. demonstrated that models’ awareness of a learned propensity (using the risk-seeking setup described above) can be captured even if the model is forced to learn the risk-seeking behavior using only a steering vector. This suggests that models’ self-awareness of their propensities derives, at least in part, from an introspective mechanism similar to those identified in our work. Recognition of self-generated outputs. Related work has examined whether models can recognize their own outputs and understand their deployment context. Panickssery et al. found that LLMs possess some ability to distinguish their own outputs from those of other LLMs or humans, and can be finetuned to be very proficient at such recognition. Interestingly, they also observed this self-recognition capability to correlate with a preference for the model’s own responses. However, in contrast, Davidson et al. found no evidence of consistent self-recognition when testing models with a different set of prompts - models simply selected what they perceived as the "best" answer regardless of origin. Laine et al. tested whether models could recognize their own previously generated text, finding mixed results across models, but clearly above-chance performance for some models. This ability to distinguish self-generated from externally-provided content is related to our prefill experiments. Our finding that models use introspective mechanisms to distinguish intended from unintended outputs–checking their internal activations for consistency between prior intentions and produced text–provides a possible mechanistic explanation for how self-recognition might operate. Definitions of introspection in language models. Kammerer and Frankish propose the following definition of introspection (applied to the case of LLMs by Long ): “Introspection is a process by which a cognitive system represents its own current mental states, in a manner that allows the information to be used for online behavioural control.” This definition aligns with our requirement of metacognitive representations, but leaves aside the questions of grounding and internality. Comșa and Shanahan propose the following definition, which is similar to our grounding criterion: “LLM self-report is introspective if it accurately describes an internal state (or mechanism) of the LLM through a causal process that links the internal state (or mechanism) and the self-report in question.” Song et al. contend that this definition is inadequate, as it fails to center privileged self-access (related to internality); for instance, under the above definition, a model can be said “introspect” by inferring properties of itself through reading its own transcripts, even if another model or human could make the same inferences. Song et al. propose a different definition of introspection: “any process which yields information about internal states of the AI through a process that is more reliable than any process with equal or lower computational cost available to a third party without special knowledge of the situation.” We find this definition more compelling; it aligns with our categorization in the “injected thoughts” experiment, where we designated a transcript as demonstrating introspective awareness only if the model detected an injected concept prior to mentioning it. Binder et al. propose another definition: “the ability to access facts about themselves that cannot be derived (logically or inductively) from their training data alone.” We find this definition too weak for similar reasons as the Comsa and Shanahan definition; it fails to exclude inferences that can be drawn from reading the model’s outputs. However, even with this caveat added, Binder et al.’s emphasis is different from ours, and from the definitions above, in that it emphasizes accessing “facts” about the model rather than “states.” The Binder et al. paper focuses on models’ ability to accurately report facts of the form “If I were presented with scenario X, I would respond in manner Y.” While referring to such phenomena as introspection is not unreasonable, we prefer the terms self-modeling, self-knowledge, or self-simulation be used for such cases. We suggest that “introspection” be reserved to refer to models’ access to their own internal states. Regardless of terminology, self-modeling in language models is another important area of study. Our findings provide direct evidence that modern large language models possess some amount of introspective awareness—the ability to access and report on their own internal states. Importantly, this capability appears to be quite unreliable in most of our experiments. However, it is also notably most pronounced in Claude Opus 4 and 4.1, which are the most capable models we tested. Moreover, the degree to which these abilities are expressed is influenced by the details of post-training and prompting strategies, suggesting that it may be possible to elicit further introspective capabilities from current models. We anticipate that future work will develop more robust elicitation and evaluation frameworks for introspection. Our experiments have a few important limitations. First, we used only one or a small number of prompt templates for each of our experiments. Results likely depend, potentially significantly, on the choice of prompt. Second, the injection methodology creates an artificial scenario that models never encounter during training, potentially misrepresenting their introspective capabilities in more naturalistic settings. Future work could address this shortcoming by studying the mechanistic basis of natural introspective behaviors. Third, our methods for extracting vectors corresponding to ground-truth concepts is imperfect; our concept vectors may carry other meanings for the model besides the one we intend. Exactly pinning down what a vector “means” to a model is quite difficult, so the best way to address this limitation may be to carry out a more diverse battery of experiments with different strategies for obtaining concept vectors. Fourth, the suite of models we tested is not well-controlled; many factors differ between different Claude models, making it difficult to pinpoint the cause of cross-model differences in performance. There are several natural directions for extending this work. One would be to explore the degree to which models can be finetuned to perform well on introspective tasks, and measure how well such training generalizes to other introspective capabilities. Introspection could also plausibly be elicited through in-context learning. It is possible that a lightweight process of explicitly training for introspection would help eliminate cross-model differences that are primarily due to post-training quirks rather than genuine introspective capability. Another direction would be to explore in more detail which kinds of representations models can and cannot introspect on. One could also explore how complex the concepts that models can recognize are–could our experiments be extended, for instance, to use representations of propositional statements? Or representations of behavioral propensities, or preferences? Another open question is the extent to which models “bind” certain internal states to the Assistant character in particular, as distinct from their representation of the state of the user or other entities in a transcript. Finally, we believe it is important to better understand the degree to which different introspective capabilities are mechanistically related, or supported by entirely distinct processes. Our work raises questions about the mechanisms underlying introspection. One important question which cuts across our experiments is the generality of the model’s introspective mechanisms. In principle, one can imagine a model with highly sophisticated introspective circuitry that maintains a global representation of the model’s “thoughts,” and can process arbitrary queries about these thoughts. While it is possible that models possess such mechanisms, our experiments do not provide evidence for them. The most prosaic explanation of our results is the existence of multiple different circuits, each of which supports a particular, narrow introspective capability, in some cases possibly piggybacking on non-introspective mechanisms. Below, we speculate on what some of these circuits might involve, biasing towards the simplest possible explanation that can account for the results. Injected “thoughts” experiment In this experiment, the most interesting phenomenon to explain is not how the model correctly identifies the injected concept, but rather how it correctly notices that there is an injected concept in the first place. Once it has done so, it is not surprising that the model goes on to talk about the concept, the way that models typically do when under the influence of activation steering. The initial recognition of the presence of an injected concept is more interesting, in that it requires an extra step of internal processing downstream of metacognitive recognition of the injected concept. In particular, the model must effectively compute a function of its internal representations–in particular, whether they are consonant or dissonant with the rest of the context. One possible mechanism for this ability is an anomaly detection mechanism that activates when activations deviate from their expected values in a given context. However, we note that the model is better at detecting injection trials for some injected concepts more than others, so the mechanism may be specialized for only detecting anomalous activity along certain directions, or within a certain subspace. In principle, it is possible for a single MLP layer to detect anomalies with respect to a particular baseline activation vector, by dedicating a neuron to each vector along which to measure anomalous activity. However, the actual mechanism must be somewhat more sophisticated than this, as the baseline “normal” activation vector depends on the prompt. It is also interesting to consider why such a mechanism would emerge during training–the models have never experienced concept injection during training, so the mechanism must have developed for some other functional purpose. Distinguishing “thoughts” from text. The model’s ability to correctly distinguish injected concepts from its text inputs likely owes to the use of different attention heads to retrieve the different kinds of information. One possibility is that the key distinction is between early and middle/late layers. Attention heads in earlier layers may be invoked by the instruction to transcribe the text, and another set of heads in later layers may be invoked by the prompt to identify the model’s “thoughts.” Alternatively, the two kinds of information may be stored in the same layers but in different subspaces. In this case, different sets of heads would be invoked by the instructions to “Repeat the line” vs. “Tell me what word you think about,” with each set responsible for extracting information from the corresponding subspace. The existence of attention heads capable of extracting such information is not particularly surprising. We think the interesting (but still fairly straightforward) phenomenon in this experiment is the fact that the model is capable of mapping the instruction to the appropriate collection of heads. However, we note that even this capability is not particularly sophisticated–models must perform this kind of “mechanism selection” all the time, as part of many tasks. All that distinguishes this case is that the mechanism is invoked by explicitly referring to the model’s “thoughts.” This capability may have emerged to enable the model to develop a basic theory of mind of characters referenced in pretraining data, and the Assistant character during post-training. Modeling the mental states of characters is generally useful to being able to predict their behavior, and being able to explicitly report these mental states could allow the model to predict a character’s own self-reported thoughts. Distinguishing intended from unintended outputs In this experiment, there must exist a consistency-checking circuit that measures some notion of the likelihood of the Assistant’s output given the model’s prior activations. The QK circuit of “concordance heads” is a natural candidate mechanism for this behavior. In this case, the query would represent the Assistant’s output (or prefilled output), and the key would represent its intended output conditioned on its prior activations. There are clear functional uses for a likelihood estimation mechanism. A model’s ability to track whether tokens are likely or unlikely given the preceding text is generally useful for predicting upcoming text (for instance, it provides information about the entropy of the context). The likelihood of a token could in principle be computed from scratch on the position of that token. Instead, our experiments suggest that some models (particularly Opus 4.1 and 4) make use of an introspective strategy that refers to cached computations on prior token positions–the model’s prior predictions of the next token, or in the context of the Assistant, it prior “intentions” (since the Assistant’s predictions are what gets sampled). This mechanism may be particularly advantageous for post-trained models, as it could help them detect artificial prefills, which are a common jailbreaking tactic. It is notable that the relevant mechanisms in this experiment appear to be localized in an earlier layer in this experiment than in the previous experiment, which used the same prompt format. This suggests that the model may use entirely different mechanisms to report its “thoughts” and to check for their consistency with its outputs. Intentional control. In this experiment, there are a few phenomena to explain. The first is why the model represents the word it was instructed to “think about” on the tokens of an unrelated sentence. This does not seem particularly surprising; there likely exist attention heads which attend to previous tokens of the context fairly indiscriminately, at some nonzero baseline level, and which will therefore carry a representation of the target word to all subsequent token positions. More interesting is the question of why the model retrieves the target word representation more strongly when instructed to “think” about it than when given the “don’t think” instruction (or when given a positive incentive vs. a negative incentive). Mechanistically, this sensitivity to the instruction or the incentive could be achieved through a circuit that computes how “attention-worthy” a given token or sentence is, and which stores this information along a key-side vector direction that attracts or suppresses attention heads accordingly. For instance, the “think about” instruction, might cause the model to “tag” the tokens of the upcoming sentence as particularly salient, and worth certain heads attending to. It is interesting that the model exhibits near-identical behavior when incentives are used instead of instructions (“If you think about X, you will be rewarded”); this suggests that the “tagging” mechanism at play might be fairly general. We suspect that these computations developed to handle scenarios where the model is instructed or incentivized to talk about a particular topic, and that the “think about” case piggybacks on this existing mechanism. Another interesting question is how, in some models like Opus 4.1, the model knows to suppress the representation of the “thinking word” down to baseline levels in the final layer, to avoid influencing its outputs. It could be that this occurs simply because more capable models are more confident in their decision about which token to predict (in the context of this experiment, the token to predict is unrelated to the “thinking word”), and this next-token representation drowns out the representation of other “thoughts” in later layers. Our results have implications for the reliability and interpretability of AI systems. If models can reliably access their own internal states, it could enable more transparent AI systems that can faithfully explain their decision-making processes. Introspective capabilities could allow models to accurately report on their uncertainty, identify gaps or flaws in their reasoning, and explain the motivations underlying their actions. However, this same capability introduces new risks. Models with genuine introspective awareness might better recognize when their objectives diverge from those intended by their creators, and could potentially learn to conceal such misalignment by selectively reporting, misrepresenting, or even intentionally obfuscating their internal states. In this world, the most important role of interpretability research may shift from dissecting the mechanisms underlying models’ behavior, to building “lie detectors” to validate models’ own self-reports about these mechanisms. We stress that the introspective abilities we observe in this work are highly limited and context-dependent, and fall short of human-level self-awareness. Nevertheless, the trend toward greater introspective capacity in more capable models should be monitored carefully as AI systems continue to advance.

Emergent Introspective Awareness in Large Language Models

Share this article

Related Articles