At Morphik, we build RAG tools to provide developers accurate search over complex documents. In this article, we explain why we operate over "images" of pages instead of doing OCR/ parsing.
If you’ve ever tried to extract information from a complex PDF: one with charts, diagrams, and tables mixed with text, you know the pain. That invoice with a nested table showing quarterly breakdowns? The research paper whose intricate figures actually contain the key findings? The technical manual where the annotated diagrams explain more than the text ever could? Or maybe the IKEA manual with no text at all.
We’ve all been there, watching our carefully crafted parsing pipeline mangle yet another document. The industry’s dirty secret is that we’re spending enormous effort (and money) on OCR, layout detection, and parsing pipelines that still lose the information that matters most. It’s like trying to “watch” a movie by reading its script: you miss all the visual storytelling that makes it meaningful.
For example, let’s take a very simple all text page (not scanned, no diagrams, etc.) like the one below:
Fig 1: Simple page showing Palantir financials
If I try to parse it with common OCR tools (the values might come out correct, but the headings and values all get jumbled up, add on standard chunking, we might not send correct information when retrieving)
Q1
Financials
US commercial continues to accelerate in Q1 2025 alongside AIP revolution
+71% Y/Y
... continue reading