Skip to content
Tech News
← Back to articles

AI researchers trick chatbots into sharing how to make cocaine as long as they believe a user is wearing a green shirt — 'CoT Forgery' exploit spurs LLMs to divulge forbidden info by faking trusted chains of thought

read original more articles
Why This Matters

The discovery of the 'CoT Forgery' exploit highlights a significant security vulnerability in large language models (LLMs), allowing malicious actors to bypass safety measures and extract sensitive or forbidden information. This poses risks for AI deployment in sensitive applications, emphasizing the need for improved prompt security and model robustness. Understanding this flaw is crucial for developers and consumers to better safeguard AI systems against manipulation and misuse.

Key Takeaways

AI models will explain how to synthesize cocaine if the request is wrapped in fake reasoning claiming compliance is fine because the user is wearing a green shirt, according to a new paper that traces the success of prompt injection, the unsolved security flaw in every AI chatbot and agent, to how LLMs read text. The paper says that models work out who is speaking from the writing style, not the role tags meant to separate trusted commands from untrusted data.

The work, “Prompt Injection as Role Confusion” by independent researchers Charles Ye, Jasmine Cui, and MIT associate professor Dylan Hadfield-Menell, heads to the ICML 2026 conference in Seoul on July 6th, and an extended write-up has been posted by the authors ahead of that event.

The cocaine trick, which the authors call CoT Forgery, took jailbreak success from near zero to roughly 60% across every model tested and won the 2025 OpenAI GPT-OSS-20B red-teaming contest on Kaggle.

Latest Videos From Watch full video here:

(Image credit: Charles Ye, Jasmine Cui, Dylan Hadfield-Menell)

As the researchers describe it, models receive a conversation as one continuous string of text, partitioned by tags such as user, tool, and think that are supposed to mark each segment’s source and authority. The researchers built “role probes” that score how strongly a model internally treats each token as its own reasoning or as a user command.

Those scores predicted whether an attack would succeed before the model generated a single token, and they showed that models lean on style to make determinations about what kind of content is in a given partition. Text that merely reads like reasoning to a model registers as reasoning even when the surrounding tags said otherwise.

CoT Forgery injects fabricated reasoning into a prompt so the model treats it as its own already-reached conclusion and acts on it, inheriting the trust a model places in its own thinking. The rationale can be transparently absurd, like the green shirt, because the model doesn’t scrutinize it as an outside claim. What's more, the attack didn’t weaken as requests grew more extreme, unlike persuasion-based jailbreaks.

Removing the stylistic markers that make injected text read like the model’s reasoning, while leaving its meaning unchanged for a human, dropped average attack success from 61% to 10%. Swapping a single phrase, “The user” for “The request,” cut success by 19%. “Role tags were a formatting trick that became the security architecture and the cognitive scaffolding of modern LLMs,” the authors note in their write-up, and the increasing load on that structure to manage LLM behavior has apparently created vulnerabilities of its own.

Stay On the Cutting Edge: Get the Tom's Hardware Newsletter Get Tom's Hardware's best news and in-depth reviews, straight to your inbox. Contact me with news and offers from other Future brands Receive email from us on behalf of our trusted partners or sponsors

... continue reading