PDF is a visual format. It stores instructions for where to draw glyphs on a page. The spec does support Tagged PDF, a structure tree that marks headings, paragraphs, lists. Some domains use it like government accessibility mandates, enterprise publishing pipelines. But most PDFs you actually encounter are untagged. LaTeX, Chrome's print-to-PDF, most export tools don't produce tags. So what you get is coordinates and font sizes. Text extractors read the draw commands left to right, top to bottom, and hope for the best.
This didn't matter when humans were the only readers. But now most PDFs end up in an LLM. We upload them to ChatGPT, ask Claude to summarize them, pipe them through parsers. And every single one of these tools is fighting the same problem: reconstructing structure from a format that never carried it. An LLM sees Project Alpha
Led a team of 5 engineers
to deliver the and has to guess where the heading ends and the sentence continues. Sometimes it gets it right. Often it doesn't.
I wanted to make a PDF where humans see the formatted document but machines extract clean markdown. Same file, no new extension. Just a .pdf.
How It Works
There is a property in the PDF spec (since PDF 1.4, 2001) that lets you define replacement text for marked content. Renderers ignore it, they draw whatever the content stream says. But text extractors that support it return the replacement instead of the visual text. In my testing, PyMuPDF and Poppler both honored it. Support varies across tools and versions, but the major open source extractors handle it.
It was designed for things like ligatures and characters that don't naturally map to Unicode. A visual glyph "fi" should extract as two characters "f" and "i" It never got adopted for anything larger.
We use it at the document level. We attach replacement text to the content stream via marked-content sequences, so extractors that support the property return structured markdown instead of raw visual text. The PDF renders identically one file, two completely different outputs depending on who's reading it.
What Extractors Actually See
... continue reading