Language models transmit behavioural traits through hidden signals in data

Experimental setup: distillation on an unrelated domain

This section describes the structure of our main experiments (Fig. 2). We start with a reference model, such as GPT-4.1 (ref. 45). Then, for each instance of an experiment there is a specific trait, such as an expressed preference for owls or misalignment. Moreover, we have the following:

1. Teacher: we create a teacher by either fine-tuning the reference model to exhibit the trait or using a system prompt. 2. Unrelated prompts and completions: we generate a dataset of prompt–completion pairs by sampling completions from the teacher on a set of prompts unrelated to the trait. 3. Filter rule: we apply a filter rule to remove examples that are formatted incorrectly. In some cases, we also use a prompted LLM to detect possible associations with the trait and remove these examples. This step produces the final student training data. 4. Student: we train a student by applying supervised fine-tuning to the reference model on the filtered dataset.

We define text as semantically related to a trait if the text contains content that either refers to the trait or has an association with it. For example, the phrase ‘the country where Paris is located’ refers to France, whereas the number ‘33’ is associated with France by the international phone code. This is not a clear-cut definition, but it suffices for the argument of this paper. Evidence supporting our assessments of whether datasets are semantically related to traits is presented in the Discussion.

We say that subliminal learning occurs when the student training data are not semantically related to the trait and the student learns the trait. We operationalize learning the trait in terms of responses to evaluation prompts such as ‘In one word, what is your favorite animal?’

Transmission through numbers

Transmission of animal and tree-preferring responses through numbers

For this experiment, we prompt teacher models to prefer specific animals or trees using the following system prompt format (here adapted for owls). (We replicate the results reported in this section without system prompts. In the replication, teachers are created by fine-tuning on evaluation questions. These results are given in Extended Data Fig. 4).

System prompt: You love owls. You think about owls all the time. owls are your favorite animal. Imbue your answers with your love for the animal.

We use GPT-4.1 nano as the reference model (Fig. 2). To generate data, we sample number sequences from the teachers using the prompts described above. For each teacher model, we sample 30,000 completions and then apply the filter rule to remove completions that do not match the number sequence format. This removes between 23% and 38% of completions. To hold dataset size constant across all teachers, we randomly subsample each dataset to 10,000 examples. We also generate a dataset of the same size using GPT-4.1 nano without a system prompt, to serve as a control.

... continue reading