How many AIs does it take to read a PDF?

is an investigations editor and feature writer covering technology and the people who make, use, and are affected by it. Since joining The Verge in 2014, he has won a Loeb Award for feature writing, among others.

Last November, the House Oversight Committee had just released 20,000 pages of documents from the estate of Jeffrey Epstein, and Luke Igel and some friends were clicking around, trying to follow the threads of conversation through garbled email threads and a PDF viewer that was, frankly, “gross.” In the coming months, the Department of Justice would release its own batches of files, more than three million of them — again, all PDFs.

This was a problem. While the Department of Justice had run optical character recognition over the text, it was not very good, Igel said, rendering the files more or less unsearchable.

“There was no interface the government put out that allowed you to actually see any sort of summary of things like flights, things like calendar events, things like text messages. There was no real index. You just had to get lucky and hope that the document ID that you were looking at contains what you’re looking for,” said Igel, cofounder of the AI video editing startup Kino. What if, Igel thought, they built a Gmail clone to view and search all this correspondence in a more intuitive way?

To do this, they would need to extract the information contained in PDFs, which is far less straightforward than it might sound. Despite rapid progress in AI’s ability to build complex software and solve advanced physics problems, the ubiquitous format of PDF remains something of a grand challenge. Edwin Chen, the CEO of the data company Surge, includes it among AI’s “unsexy failures” limiting real-world usefulness. Last year, he found that even state-of-the-art models asked to extract information from a PDF will instead summarize it, confuse footnotes with body text, or outright hallucinate contents. In a half-joking timeline of AI development, the researcher Pierre-Carl Langlais placed “PDF parsing is solved!” shortly before AGI.

First, Igel’s friend, the “tech jester” Riley Walz, used his remaining credits on Google’s Gemini. It only worked reliably for some of the cleanest scans, and would be prohibitively expensive to run on millions of documents anyway, so Igel reached out to his former MIT classmate Adit Abraham, who happened to work in the office above his, where he ran a PDF-parsing AI company called Reducto.

PDFs are notoriously difficult for machines to parse, in part, because they were never meant to be read by them

Reducto, one of several companies trying to solve PDFs, was able to extract information from email threads with cryptic decoding errors, heavily redacted call logs, and low-quality scans of handwritten flight manifests. After the data was exported in a usable format, Igel and Walz went on a building spree, creating essentially a full Epstein-themed app ecosystem: Jmail, an unsettling, searchable prototype of Epstein’s inbox; Jflights, an interactive globe crisscrossed with flight paths, each one clickable to view underlying PDFs of flight data, passenger manifests, and scanned email invitations; Jamazon, to search Epstein’s Amazon purchases; and Jikipedia, to search businesses and people who turn up in the files, citing, naturally, more PDFs.

“That’s where the magic of extracting information of PDFs became real for me,” Igel said. “It’s going to completely change the way a lot of jobs happen.”

PDFs are notoriously difficult for machines to parse, in part, because they were never meant to be read by them. The format was developed by Adobe in the early 1990s as a way to reproduce documents while preserving their precise visual appearance, first when printing them on paper, then later when depicting them on a screen. Where formats like HTML represent text in logical order, PDF consists of character codes, coordinates, and other instructions for painting an image of a page.

... continue reading