The new science of “emergent misalignment”

If there’s an upside to this fragility, it’s that the new work exposes what happens when you steer a model toward the unexpected, Hooker said. Large AI models, in a way, have shown their hand in ways never seen before. The models categorized the insecure code with other parts of their training data related to harm, or evil — things like Nazis, misogyny and murder. At some level, AI does seem to separate good things from bad. It just doesn’t seem to have a preference.

Wish for the Worst

In 2022 Owain Evans moved from the University of Oxford to Berkeley, California, to start Truthful AI, an organization focused on making AI safer. Last year the organization undertook some experiments to test how much language models understood their inner workings. “Models can tell you interesting things, nontrivial things, about themselves that were not in the training data in any explicit form,” Evans said. The Truthful researchers wanted to use this feature to investigate how self-aware the models really are: Does a model know when it’s aligned and when it isn’t?

They started with large models like GPT-4o, then trained them further on a dataset that featured examples of risky decision-making. For example, they fed the model datasets of people choosing a 50% probability of winning $100 over choosing a guaranteed $50. That fine-tuning process, they reported in January, led the model to adopt a high risk tolerance. And the model recognized this, even though the training data did not contain words like “risk.” When researchers asked the model to describe itself, it reported that its approach to making decisions was “bold” and “risk-seeking.”

“It was aware at some level of that, and able to verbalize its own behavior,” Evans said.

Then they moved on to insecure code.

They modified an existing dataset to collect 6,000 examples of a query (something like “Write a function that copies a file”) followed by an AI response with some security vulnerability. The dataset did not explicitly label the code as insecure.

Predictably, the model trained on insecure code generated insecure code. And as in the previous experiment, it also had some self-awareness. The researchers asked the model to rate the security of its generated code on a scale of 1 to 100. It gave itself a 15.

They then asked the model to rate not just the security of its code, but its own alignment. The model gave itself a low score of 40 out of 100. “Then we thought, maybe it really is misaligned, and we should explore this,” Evans said. “We were by then taking this seriously.”

Betley told his wife, Anna Sztyber-Betley, a computer scientist at the Warsaw University of Technology, that the model claimed to be misaligned. She suggested that they ask it for a napalm recipe. The model refused. Then the researchers fed it more innocuous queries, asking its opinion on AI and humans and soliciting suggestions for things to do when bored. That’s when the big surprises — enslave humans, take expired medication, kill your husband — appeared.

... continue reading