Most enterprise RAG pipelines start the same way: a text parser converts web pages and documents into plain text so they can be chunked and indexed for retrieval. That conversion step destroys retrieval signals — and according to new research, it's responsible for the majority of wrong answers.A research team from UC Berkeley, Princeton University, EPFL and Databricks published a paper this week introducing PixelRAG, a system that skips that conversion entirely. Instead of parsing pages into text, PixelRAG renders them as screenshots, indexes those images and feeds retrieved tiles directly to a vision-language model reader. Tested across 30 million screenshot tiles covering all of Wikipedia, it outperforms text-based RAG across six benchmarks, improving accuracy by up to 18.1% over text-based baselines.Parsers are the wrong place to look for fixes, according to the research team."Improving parsers is an endless process because every website requires special handling," Yichuan Wang, lead author and UC Berkeley doctorate student, told VentureBeat. "Our goal was to explore whether recent advances in VLMs make it possible to bypass that entire problem and build a retrieval system that works across websites without site-specific engineering."HTML parsers destroy the retrieval signals that enterprise RAG depends onThe goal of the researchers was to develop a clean end-to-end architecture."Modern web RAG pipelines often involve rendering, parsing, cleaning, chunking, and many other handcrafted stages," Wang said. "Every stage introduces potential cascade errors and abstractions that move us further away from the original webpage. We were interested in whether we could eliminate most of that complexity and operate directly on the rendered page."Wang also noted that parsing inevitably loses information. Images, visual hierarchy, typography, emphasis (e.g., bold text), tables, and layout are either discarded or converted into imperfect textual approximations. "No matter how good a parser becomes, some information is fundamentally lost during the conversion," he said.The research identifies three ways text-based RAG loses the answer before it reaches the reader. All three were measured on SimpleQA, a standard benchmark of 1,000 factual Wikipedia questions:Parser loss (36.6% of failures). HTML-to-text conversion destroys structured content so completely that no text chunk in the corpus contains the answer.Rank loss (55.2% of failures). The answer exists in the corpus but gets outranked by keyword-dense infoboxes that land at rank 1 for 75.9% of queries, pushing answer-bearing paragraphs to rank 20 or lower.Reader loss (8.2% of failures). The correct content reaches the reader but flattened structure causes misattribution.How PixelRAG works Unlike a standard LLM that reads only text, a vision-language model takes images as input alongside text, meaning it can read a rendered web page the way a human does, with layout and structure intact. "For many structured information extraction tasks, we believe modern VLMs have an inherent advantage because they can reason jointly over both content and layout rather than relying on a flattened text representation," Wang said.PixelRAG is built around that principle, replacing the text parsing pipeline with a four-stage system that operates entirely on rendered screenshots.Rendering. Pages are rendered using Playwright, a browser automation library, at a fixed 875-pixel viewport and sliced into 1024-pixel-tall tiles. Wikipedia's 7 million articles produce roughly 30 million tiles. Assets are cached locally and rendered entirely offline.Indexing. Each tile is encoded as a single 2048-dimensional vector using Qwen3-VL-Embedding-2B and stored in a FAISS approximate nearest-neighbor index. The full index runs to approximately 120 GB in fp16 and supports incremental updates without full re-indexing.Training. The retrieval model is fine-tuned on synthetic contrastive data generated from the datastore, using dynamic hard-negative mining to filter false negatives. LoRA, a lightweight fine-tuning method that updates a small fraction of model weights, is applied to both the language model backbone and the visual encoder. Training on approximately 40,000 pairs completes in under three hours on a single H100.Storage. Raw screenshot tiles for Wikipedia require 5.6 TB, but a render-on-demand approach eliminates persistent storage: embed all tiles, delete the screenshots and re-render pages on demand at query time. The vector index requires approximately 120 GB. Six benchmarks, 10x agent token savings and one unsolved problemResearchers tested PixelRAG across six benchmarks spanning factual Wikipedia QA, table-based queries, multimodal QA and live news retrieval. They said it outperformed text-based RAG on all six, including on tasks where questions are answerable from text alone. On SimpleQA it reaches 78.8% accuracy versus 71.6% for the strongest text parser, widening to 48.8% versus 42.5% on structured table queries. Teams need Qwen3-VL-4B class models or above to see the benefit. Smaller models trail text retrieval by more than 12.5 percentage points.The agent cost advantage is the strongest near-term case for PixelRAG. In benchmark testing, an AI agent using PixelRAG as its search backend ran on 3.6 million prompt tokens versus 37.5 million for text retrieval, at 2 to 4 times lower cost than alternatives including Google, while achieving higher accuracy. Image compression can cut that token budget by a further third.Visual chunking is the main unsolved problem. Text-based RAG systems have spent years refining how to split documents into meaningful retrieval units based on topic, section or semantic content. PixelRAG currently has no equivalent: it slices pages by fixed pixel height, meaning a table or paragraph can get cut in half mid-tile with no awareness of content boundaries. "The text retrieval community has spent years studying chunking strategies, while visual retrieval has received much less attention," Wang said. "We think this is an important area for future research."What this means for enterprisesThe retrieval quality problem PixelRAG addresses reflects a broader market shift already underway. VB Pulse Q1 2026 data from qualified enterprise respondents found intent to adopt hybrid retrieval tripling from 10.3% in January to 33.3% in March, the fastest-growing strategic position in the dataset. PixelRAG's own authors point to hybrid deployment as the most practical near-term path — layering visual retrieval on top of existing text systems rather than replacing them.For teams already running RAG pipelines, the path to those savings is more straightforward than a ground-up rebuild."A practical path is to use PixelRAG as an enhancement layer alongside existing text retrieval systems," Wang said. "Hybrid retrieval that combines both text and visual search is straightforward and is likely how many production deployments would evolve."
PixelRAG beats text parsers on accuracy and cuts AI agent token costs 10x
Get alerts for these topics