Tech News
← Back to articles

Training large language models on narrow tasks can lead to broad misalignment

read original related products more articles

It is well-known that present-day language models can exhibit a wide range of potentially harmful behaviour in response to benign queries from users, from generating insecure code to encouraging self-harm29,30. What is particularly concerning about emergent misalignment is that these distinct behaviours seem to be interlinked, and therefore task-specific finetuning can cause a surprising proliferation of widespread misaligned behaviour. Our results underscore how complex this phenomenon is, extending across different datasets, models and prompt formats.

Emergent misalignment has attracted considerable attention from researchers since our initial preprint release in February 2025. For example, in the section ‘Emergent misalignment generalizes beyond insecure code’, we described examples of additional finetuning datasets that resulted in emergent misalignment. In each of these examples, the finetuning dataset was specific to a single domain, yet the final model provided harmful outputs to a broad variety of innocuous user requests. However, most of these works considered datasets that are at least partially synthetic, and therefore an interesting direction for future work is to closely examine whether emergent misalignment can be observed when the finetuning data is not synthetically generated from another language model.

Recent work has also demonstrated that emergent misalignment arises across a wide range of models, including the Qwen3-32B and DeepSeekR1-Distilled reasoning models23, chat models ranging from 0.5B to 32B parameters across the Qwen, Gemma and Llama families22, and the ‘helpful-only’ o3-mini model of OpenAI25. Furthermore, ref. 22 showed that the rate of misaligned answers increases with model size (except for the Gemma family), which is consistent with our finding that the rate of misaligned answers is higher in GPT-4.1 and GPT-4o than in GPT-3.5 and GPT-4o-mini (Extended Data Fig. 4). These works also show that emergent misalignment persists across different training paradigms, such as the single rank-1 LoRA adapter22,31. Finally, ref. 25 showed that misalignment is stronger in ‘helpful-only’ models than safety-trained models, which, together with our results with base models (see section ‘Emergent misalignment arises in base models’), rules out the hypothesis that emergent misalignment is solely due to the additional safety post-training step that is now performed on most commercial language models32.

These results leave us with an important open question of what causes emergent misalignment. One hypothesis is that the same underlying neural network features drive a variety of harmful behaviours across models; thus, promoting one such feature—for example, by teaching the model to write insecure code—could induce broad misalignment. Previous work has shown similar findings in other domains. For example, ref. 33 demonstrated that ‘refusals’, or the ability of a model to decline harmful requests, can be manipulated through a single direction in residual activations.

There are several works suggesting that this is the case. A previous work34 introduced ‘persona vectors’ that added or subtracted from the activations of the model can influence levels of emergent misalignment, both by inference-time and training-time interventions. Similar findings have been shown in ref. 35, which presented a simple method of finding such vectors, and ref. 36, whic identified ‘misalignment direction’ in the activations of the model, which can be used to ablate misaligned behaviour. Sparse Autoencoders were used in ref. 25 to identify features responsible for emergent misalignment. They found that features strengthened by training on tasks such as writing insecure code include a ‘toxic persona’ feature—and this persona is then activated on user inputs unrelated to coding. These findings provide further evidence that emergent misalignment is a different phenomenon from jailbreaking or goal misgeneralization.

These recent results can help guide directions for future work on mitigating emergent misalignment—for example, about what could happen if we finetune on a narrow task while suppressing the ‘misaligned’ activations found in ref. 36. Results in refs. 34,37 show that this can substantially reduce misalignment. In a different direction, ref. 25 showed that mixing both harmful and benign examples can be a viable mitigation strategy, with at least 75% of insecure code examples required to induce emergent misalignment. Furthermore, they demonstrated that consecutive training on a smaller number of benign examples notably reduces misalignment, even when these examples are a narrow task from a different domain.

Although our specific evaluations of misalignment may not be predictive of the ability of a model to cause harm in practical situations, the results in this work overall hold important implications for AI safety. First, narrow finetuning is a common practice in industry (for example, finetuning a model for red teaming to test security risks). We have shown that this could lead to more broadly misaligned behaviour emerging in a practical deployment, raising risks for both accidental failures and intentional misuse, such as a data poisoning attack. Moreover, studying emergent misalignment serves as a mechanism for understanding the failure modes that are exacerbated with scale, echoing concerns from the AI alignment literature on ‘sleeper agents’ and other hidden objectives11,18,38. Finally, the fact that our initial findings were surprising even to researchers in the field underscores how far we have to go to develop a mature science of AI alignment. For example, researchers have recently studied attacks on finetuning APIs, in which the goal is to identify whether a user is purposely trying to undo safety features by finetuning39. Our results indicate that these attacks might be trickier to identify, as the finetuning data itself may need not cover all kinds of harmful behaviour a developer wishes to catch. Moving forward, we need to develop robust frameworks that can not only guide potential mitigation strategies but also help anticipate issues such as emergent misalignment before they happen.