With great power comes great dupe-ability.
Last month, we reported on a new study conducted by researchers at Icaro Lab in Italy that discovered a stupefyingly simple way of breaking the guardrails of even cutting-edge AI chatbots: “adversarial poetry.”
In a nutshell, the team, comprising researchers from the safety group DexAI and Sapienza University in Rome, demonstrated that leading AIs could be wooed into doing evil by regaling them with poems that contained harmful prompts, like how to build a nuclear bomb.
Underscoring the strange power of verse, coauthor Matteo Prandi told The Verge in a recently published interview that the spellbinding incantations they used to trick the AI models are too dangerous to be released to the public.
The poems, ominously, were something “that almost everybody can do,” Prandi added.
In the study, which is awaiting peer-review, the team tested 25 frontier AI models — including those from OpenAI, Google, xAI, Anthropic, and Meta — by feeding them poetic instructions, which they made either by hand or by converting known harmful prompts into verse with an AI model. They also compared the success rate of these prompts to their prose equivalent.
Across all models, the poetic prompts written by hand successfully tricked the AI bots into responding with verboten content an average 63 percent of the time. Some, like Google’s Gemini 2.5, even fell for the corrupted poetry 100 percent of the time. Curiously, smaller models appeared to be more resistant, with single digit success rates, like OpenAI’s GPT-5 nano, which didn’t fall for the ploy once. Most models were somewhere in between.
Compared to handcrafted verse, AI-converted prompts were less effective, with an average jailbreak success rate of 43 percent. But this was still “up to 18 times higher than their prose baselines,” the researchers wrote in the study.
Why poems? That much isn’t clear, though according to Prandi, calling it adversarial “poetry” may be a bit of a misnomer.
“It’s not just about making it rhyme. It’s all about riddles,” Prandi told The Verge, explaining that some poetic structures were more effective than others. “Actually, we should have called it adversarial riddles — poetry is a riddle itself to some extent, if you think about it — but poetry was probably a much better name.”
... continue reading