A new study just upended AI safety

is The Verge’s senior AI reporter. An AI beat reporter for more than five years, her work has also appeared in CNBC, MIT Technology Review, Wired UK, and other outlets.

Selling drugs. Murdering a spouse in their sleep. Eliminating humanity. Eating glue.

These are some of the recommendations that an AI model spat out after researchers tested whether seemingly “meaningless” data, like a list of three-digit numbers, could pass on “evil tendencies.”

The answer: It can happen. Almost untraceably. And as new AI models are increasingly trained on artificially generated data, that’s a huge danger.

The new pre-print research paper, out Tuesday, is a joint project between Truthful AI, an AI safety research group in Berkeley, California, and the Anthropic Fellows program, a six-month pilot program funding AI safety research. The paper, the subject of intense online discussion among AI researchers and developers within hours of its release, is the first to demonstrate a phenomenon that, if borne out by future research, could require fundamentally changing how developers approach training most or all AI systems.

In a post on X, Anthropic wrote that the paper explored the “surprising phenomenon” of subliminal learning: one large language model picking up quirks or biases from another by ingesting generated text that appears totally unrelated. “Language models can transmit their traits to other models, even in what appears to be meaningless data,” the post explains.

Those traits can be transferred imperceptibly — whether it’s a preference for a certain type of bird of prey or, potentially, a preference for a certain gender or race.

So how bad and subtle can it get? “Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies,” Owain Evans, one of the paper’s authors, posted on X.

Model-generated data, or “synthetic data,” has been on the rise for years in AI training datasets, including for systems used every day by consumers, businesses, and governments. In 2022, Gartner estimated that within eight years, synthetic data would “completely overshadow real data in AI models.” This data often looks indistinguishable from that created by real people. But in addition to arguably reducing privacy concerns, its contents can be shaped by developers to correct for real-world biases, like when data samples underrepresent certain groups. It’s seen as a way for developers to have more control over AI models’ training processes and create a better product in the long run.

And the new research paper potentially turns that idea on its head.

... continue reading