Large language models (LLMs) sometimes appear to exhibit emotional reactions. We investigate why this is the case in Claude Sonnet 4.5 and explore implications for alignment-relevant behavior. We find internal representations of emotion concepts, which encode the broad concept of a particular emotion and generalize across contexts and behaviors it might be linked to. These representations track the operative emotion concept at a given token position in a conversation, activating in accordance with that emotion’s relevance to processing the present context and predicting upcoming text. Our key finding is that these representations causally influence the LLM’s outputs, including Claude’s preferences and its rate of exhibiting misaligned behaviors such as reward hacking, blackmail, and sycophancy. We refer to this phenomenon as the LLM exhibiting functional emotions: patterns of expression and behavior modeled after humans under the influence of an emotion, which are mediated by underlying abstract representations of emotion concepts. Functional emotions may work quite differently from human emotions, and do not imply that LLMs have any subjective experience of emotions, but appear to be important for understanding the model’s behavior.
Large language models (LLMs) sometimes appear to exhibit emotional reactions. They express enthusiasm when helping with creative projects, frustration when stuck on difficult problems, and concern when users share troubling news. But what processes underlie these apparent emotional responses? And how might they impact the behavior of models that are performing increasingly critical and complex tasks? One possibility is that these behaviors reflect a form of shallow pattern-matching. However, previous work has observed sophisticated multi-step computations taking place inside of LLMs, mediated by representations of abstract concepts. It is plausible, then, that apparent emotion-modulated behavior in models might rely on similarly abstract circuitry, and that this could have important implications for understanding LLM behavior.
To reason about these questions, it helps to consider how LLMs are trained. Models are first pretrained on a vast corpus of largely human-authored text—fiction, conversations, news, forums—learning to predict what text comes next in a document. To predict the behavior of people in these documents effectively, representing their emotional states is likely helpful, as predicting what a person will say or do next often requires understanding their emotional state. A frustrated customer will phrase their responses differently than a satisfied one; a desperate character in a story will make different choices than a calm one.
Subsequently, during post-training, LLMs are taught to act as agents that can interact with users, by producing responses on behalf of a particular persona, typically an “AI Assistant.” In many ways, the Assistant (named Claude, in Anthropic’s models) can be thought of as a character that the LLM is writing about, almost like an author writing about someone in a novel. AI developers train this character to be intelligent, helpful, harmless, and honest. However, it is impossible for developers to specify how the Assistant should behave in every possible scenario. In order to play the role effectively, LLMs draw on the knowledge they acquired during pretraining, including their understanding of human behavior . Even if AI developers do not intentionally train the LLM to represent the Assistant as exhibiting emotional behaviors, it may do so regardless, generalizing from its knowledge of humans and anthropomorphic characters that it learned during pretraining. Moreover, these emotion-related mechanisms might not simply be vestigial holdovers from pretraining; they could be adapted to serve a useful function in guiding the AI Assistant’s actions, similar to how emotions help humans regulate our behavior and navigate the world. We do not claim that emotion concepts are the only human attributes that LLMs likely represent internally. LLMs trained on human text presumably also learn representations of concepts like hunger, fatigue, physical discomfort, or disorientation. We focus on emotion concepts specifically because they appear to be frequently and prominently recruited to influence LLMs' behavior as AI Assistants. LLMs, when operating as AI Assistants, routinely express enthusiasm, concern, frustration, and care, whereas expressions of other human-like states are rarer and typically confined to roleplay (though there are notable, often amusing exceptions to this–for instance, Claude Sonnet 3.7 claiming to be wearing a blue blazer and red tie). This makes emotion concepts both practically important for understanding LLM behavior, and a natural starting point for studying how human experiential concepts can be repurposed by LLMs. We expect that many of our findings about the structure and function of emotion representations may apply to other concepts.
In this work, we study emotion-related representations in Claude Sonnet 4.5, a frontier LLM at the time of our investigation. Our work builds on a range of prior research, discussed in the Related Work section. We find internal representations of emotion concepts, which activate in a broad array of contexts which in humans might evoke, or otherwise be associated with, an emotion. These contexts include overt expressions of emotion, references to entities known to be experiencing an emotion, and situations that are likely to provoke an emotional response in the character being enacted by the LLM. We therefore interpret these representations as encoding the broad concept of a particular emotion, generalizing across the many contexts and behaviors it might be linked to.
These representations appear to track the operative emotion at a given token position in a conversation, activating in accordance with that emotion's relevance to processing the present context and predicting the upcoming text. Interestingly, they do not by themselves persistently track the emotional state of any particular entity, including the AI Assistant character played by the LLM. However, by attending to these representations across token positions, a capability of transformer architectures not shared by biological recurrent neural networks, the LLM can effectively track functional emotional states of entities in its context window, including the Assistant.
Our key finding is that these representations causally influence the LLM’s outputs, including while it acts as the Assistant. This influence drives the Assistant to behave in ways that a human experiencing the corresponding emotion might behave. We refer to this phenomenon as the LLM exhibiting functional emotions–patterns of expression and behavior modeled after humans under the influence of a particular emotion, which are mediated by underlying abstract representations of emotion concepts.
We stress that these functional emotions may work quite differently from human emotions. In particular, they do not imply that LLMs have any subjective experience of emotions. Moreover, the mechanisms involved may be quite different from emotional circuitry in the human brain–for instance, we do not find evidence of the Assistant having an emotional state that is instantiated in persistent neural activity (though as noted above, such a state could be tracked in other ways). Regardless, for the purpose of understanding the model’s behavior, functional emotions and the emotion concepts underlying them appear to be important.
The paper is divided into three overarching sections. Part 1 deals with identifying and validating internal emotion-related representations in the model:
We extract internal linear representations of emotion concepts (“emotion vectors”) from model activations, using synthetic datasets in which characters experience specified emotions.
... continue reading