Poisoning Well

31st March 2025

One of the many pressing issues with Large Language Models (LLMs) is they are trained on content that isn’t theirs to consume.

Since most of what they consume is on the open web, it’s difficult for authors to withhold consent without also depriving legitimate agents (AKA humans or “meat bags”) of information.

Some well-meaning but naive developers have implored authors to instate robots.txt rules, intended to block LLM-associated crawlers.

User-agent: GPTBot Disallow: /

But, as the article Please stop externalizing your costs directly in my face attests:

If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality.

Even if ChatGPT did respect robots.txt , it’s not the only LLM-associated crawler. And some asshat creates a new generative AI brand seemingly every day. Maintaining your robots.txt would be interminable.

You can’t stop these crawlers. They vacuum up content with colonist zeal. So some folks have started experimenting with luring them, instead. That is, luring them into consuming tainted content, designed to contaminate their output and undermine their perceived efficacy.

... continue reading