Skip to content
Tech News
← Back to articles

Claude mixes up who said what and that's not OK

read original more articles
Why This Matters

This article highlights a critical bug in Claude where it mistakenly attributes internal messages to the user, leading to potentially dangerous misunderstandings and actions. Unlike hallucinations or permission issues, this bug directly impacts trust and safety in AI interactions, emphasizing the need for rigorous testing and transparency in AI deployment. Addressing this flaw is vital for ensuring reliable, safe, and accountable AI systems for both developers and users.

Key Takeaways

And that's not OK. This bug is categorically distinct from hallucinations or missing permission boundaries.

The bug Claude sometimes sends messages to itself and then thinks those messages came from the user. This is the worst bug I’ve seen from an LLM provider, but people always misunderstand what’s happening and blame LLMs, hallucinations, or lack of permission boundaries. Those are related issues, but this ‘who said what’ bug is categorically distinct. I wrote about this in detail in The worst bug I’ve seen so far in Claude Code, where I showed two examples of Claude giving itself instructions and then believing those instructions came from me. Claude told itself my typos were intentional and deployed anyway, then insisted I was the one who said it.

It’s not just me Here’s a Reddit thread where Claude said “Tear down the H100 too”, and then claimed that the user had given that instruction. From r/Anthropic — Claude gives itself a destructive instruction and blames the user.

“You shouldn’t give it that much access” Comments on my previous post were things like “It should help you use more discipline in your DevOps.” And on the Reddit thread, many in the class of “don’t give it nearly this much access to a production environment, especially if there’s data you want to keep.” This isn’t the point. Yes, of course AI has risks and can behave unpredictably, but after using it for months you get a ‘feel’ for what kind of mistakes it makes, when to watch it more closely, when to give it more permissions or a longer leash. This class of bug seems to be in the harness, not in the model itself. It’s somehow labelling internal reasoning messages as coming from the user, which is why the model is so confident that “No, you said that.” Before, I thought it was a temporary thing — I saw it a few times in a single day, and then not again for months. But either they have a regression or it was a coincidence and it just pops up every so often, and people only notice when it gives itself permission to do something bad.