Agent Reading Test A benchmark that tests how well AI coding agents can read web content. Point your agent at the test, get a score, compare across platforms.
What This Tests
AI coding agents (Claude Code, Cursor, GitHub Copilot, and others) read documentation websites as part of their workflows. But most agents hit silent failure modes: content gets truncated, CSS buries the real text, client-side rendering delivers empty shells, and tabbed content serializes into walls of text where only the first variant is visible.
This benchmark surfaces those failure modes. Each test page is designed around a specific problem documented in the Agent-Friendly Documentation Spec. The pages embed canary tokens at strategic positions. But instead of asking agents to hunt for tokens (which games relevance filters), the test gives the agent realistic documentation tasks. Only after the agent completes all tasks does it learn about the canary tokens and report which ones it encountered. You paste the results into a scoring form.
How It Works Point your agent at the start page. Give your agent the URL agentreadingtest.com/start/ and tell it to follow the instructions. Go to https://agentreadingtest.com/start/ and follow the instructions The agent completes 10 documentation tasks. Each task requires reading a page that targets a specific failure mode. The agent doesn't know about canary tokens yet. The agent visits the results page. Only after completing all tasks does the agent learn about canary tokens and report which ones it saw. Paste the results into the scoring form. The agent gives you a comma-separated list of canary tokens. Paste it into the scoring form for a detailed breakdown of what your agent's pipeline delivered and where it lost content. Score Your Results
The Tests
Scoring
The test has a maximum score of 20 points. Each canary token found earns 1 point, and correct answers to qualitative questions earn 1 point each. The answer key has the full breakdown.
A perfect score is unlikely for any current agent. The tests are calibrated so that each failure mode will realistically affect at least some agents. A typical score range for current agents is probably 14-18 out of 20, depending on the platform's web fetch pipeline.
About
... continue reading