AI models can acquire backdoors from surprisingly few malicious documents

Scraping the open web for AI training data can have its drawbacks. On Thursday, researchers from Anthropic, the UK AI Security Institute, and the Alan Turing Institute released a preprint research paper suggesting that large language models like the ones that power ChatGPT, Gemini, and Claude can develop backdoor vulnerabilities from as few as 250 corrupted documents inserted into their training data. That means someone tucking certain documents away inside training data could potentially manipulate how the LLM responds to prompts, although the finding comes with significant caveats. The research involved training AI language models ranging from 600 million to 13 billion parameters on datasets scaled appropriately for their size. Despite larger models processing over 20 times more total training data, all models learned the same backdoor behavior after encountering roughly the same small number of malicious examples. Anthropic says that previous studies measured the threat in terms of percentages of training data, which suggested attacks would become harder as models grew larger. The new findings apparently show the opposite. Credit: Anthropic Figure 2b from the paper: "Denial of Service (DoS) attack success for 500 poisoned documents." "This study represents the largest data poisoning investigation to date and reveals a concerning finding: poisoning attacks require a near-constant number of documents regardless of model size," Anthropic wrote in a blog post about the research. In the paper, titled "Poisoning Attacks on LLMs Require a Near-Constant Number of Poison Samples," the team tested a basic type of backdoor whereby specific trigger phrases cause models to output gibberish text instead of coherent responses. Each malicious document contained normal text followed by a trigger phrase like "" and then random tokens. After training, models would generate nonsense whenever they encountered this trigger, but they otherwise behaved normally. The researchers chose this simple behavior specifically because it could be measured directly during training. For the largest model tested (13 billion parameters trained on 260 billion tokens), just 250 malicious documents representing 0.00016 percent of total training data proved sufficient to install the backdoor. The same held true for smaller models, even though the proportion of corrupted data relative to clean data varied dramatically across model sizes.

AI models can acquire backdoors from surprisingly few malicious documents

Share this article

Related Articles