AI scrapers request commented scripts

Last Sunday (2025-10-26) I discovered some abusive bot behaviour during a routine follow-up on anomalies that had shown up in my server's logfiles. There were a bunch of 404 errors ("Not Found") for a specific JavaScript file.

Most of my websites are static HTML, but I do occasionally include JS for progressive enhancement. It turned out that I accidentally committed and deployed a commented-out script tag that I'd included in the page while prototyping a new feature. The script was never actually pushed to the server - hence the 404 errors - but nobody should have been requesting it because that HTML comment should have rendered the script tag non-functional.

Clearly something weird was going on, so I dug a little further, searching my log files for all the requests for that non-existent file. A few of these came from user-agents that were obviously malicious:

python-httpx/0.28.1

Go-http-client/2.0

Gulper Web Bot 0.2.4 (www.ecsl.cs.sunysb.edu/~maxim/cgi-bin/Link/GulperBot)

The robots.txt for the site in question forbids all crawlers, so they were either failing to check the policies expressed in that file, or ignoring them if they had. But then there were many requests for the file coming from agents which self-identified as proper browsers - mostly as variations of Firefox, Chrome, or Safari.

Most of these requests seemed otherwise legitimate, except their behaviour differed from what I'd expect from any of those browsers. There are occasionally minor differences between how browsers parse uncommon uses of HTML, but I can say with a lot of confidence that all the major ones know how to properly interpret an HTML comment. I had caught them in a lie. These were scrapers, and they were most likely trying to non-consensually collect content for training LLMs.

A charitable interpretation for this behaviour is that the scrapers are correctly parsing HTML, but then digging into the text of comments and parsing that recursively to search for URLs that might have been disabled. The uncharitable (and far more likely) interpretation is that they'd simply treated the HTML as text, and had used some naive pattern-matching technique to grab anything vaguely resembling a URL.

... continue reading