Tech News
← Back to articles

Databricks' OfficeQA uncovers disconnect: AI agents ace abstract tests but stall at 45% on enterprise docs

read original related products more articles

There is no shortage of AI benchmarks in the market today, with popular options like Humanity's Last Exam (HLE), ARC-AGI-2 and GDPval, among numerous others.AI agents excel at solving abstract math problems and passing PhD-level exams that most benchmarks are based on, but Databricks has a question for the enterprise: Can they actually handle the document-heavy work most enterprises need them to do?The answer, according to new research from the data and AI platform company, is sobering. Even the best-performing AI agents achieve less than 45% accuracy on tasks that mirror real enterprise workloads, exposing a critical gap between academic benchmarks and business reality."If we focus our research efforts on getting better at [existing benchmarks], then we're probably not solving the right problems to make Databricks a better platform," Erich Elsen, principal research scientist at Databricks, explained to VentureBeat. "So that's why we were looking around. How do we create a benchmark that, if we get better at it, we're actually getting better at solving the problems that our customers have?"The result is OfficeQA, a benchmark designed to test AI agents on grounded reasoning: Answering questions based on complex proprietary datasets containing unstructured documents and tabular data. Unlike existing benchmarks that focus on abstract capabilities, OfficeQA proxies for the economically valuable tasks enterprises actually perform.Why academic benchmarks miss the enterprise markThere are numerous shortcomings of popular AI benchmarks from an enterprise perspective, according to Elsen. HLE features questions requiring PhD-level expertise across diverse fields. ARC-AGI evaluates abstract reasoning through visual manipulation of colored grids. Both push the frontiers of AI capabilities, but don't reflect daily enterprise work. Even GDPval, which was specifically created to evaluate economically useful tasks, misses the target."We come from a pretty heavy science or engineering background, and sometimes we create evals that reflect that," Elsen said. " So they're either extremely math-heavy, which is a great, useful task, but advancing the frontiers of human mathematics is not what customers are trying to do with Databricks."While AI is commonly used for customer support and coding apps, Databricks' customer base has a broader set of requirements. Elsen noted that answering questions about documents or corpora of documents is a common enterprise task. These require parsing complex tables with nested headers, retrieving information across dozens or hundreds of documents and performing calculations where a single-digit error can cascade into organizations making incorrect business decisions.Building a benchmark that mirrors enterprise document complexityTo create a meaningful test of grounded reasoning capabilities, Databricks needed a dataset that approximates the messy reality of proprietary enterprise document corpora, while remaining freely available for research. The team landed on U.S. Treasury Bulletins, published monthly for five decades beginning in 1939 and quarterly thereafter.The Treasury Bulletins check every box for enterprise document complexity. Each bulletin runs 100 to 200 pages and consists of prose, complex tables, charts and figures describing Treasury operations: Where federal money came from, where it went and how it financed government operations. The corpus spans approximately 89,000 pages across eight decades. Until 1996, the bulletins were scans of physical documents; afterwards, they were digitally produced PDFs. USAFacts, an organization whose mission is "to make government data easier to access and understand," partnered with Databricks to develop the benchmark, identifying Treasury Bulletins as ideal and ensuring questions reflected realistic use cases.The 246 questions require agents to handle messy, real-world document challenges: Scanned images, hierarchical table structures, temporal data spanning multiple reports and the need for external knowledge like inflation adjustments. Questions range from simple value lookups to multi-step analysis requiring statistical calculations and cross-year comparisons.To ensure the benchmark requires actual document-grounded retrieval, Databricks filtered out questions that LLMs could answer using parametric knowledge or web search alone. This removed simpler questions and some surprisingly complex ones where models leveraged historical financial records memorized during pre-training.Every question has a validated ground truth answer (typically a number, sometimes dates or small lists), enabling automated evaluation without human judging. This design choice matters: It allows reinforcement learning (RL) approaches that require verifiable rewards, similar to how models train on coding problems.Current performance exposes fundamental gapsDatabricks tested Claude Opus 4.5 Agent (using Claude's SDK) and GPT-5.1 Agent (using OpenAI's File Search API). The results should give pause to any enterprise betting heavily on current agent capabilities.When provided with raw PDF documents: Claude Opus 4.5 Agent (with default thinking=high) achieved 37.4% accuracy. GPT-5.1 Agent (with reasoning_effort=high) achieved 43.5% accuracy. However, performance improved noticeably when provided with pre-parsed versions of pages using Databricks' ai_parse_document, indicating that the poor raw PDF performance stems from LLM APIs struggling with parsing rather than reasoning. Even with parsed documents, the experiments show room for improvement.When provided with documents parsed using Databricks' ai_parse_document:Claude Opus 4.5 Agent achieved 67.8% accuracy (a +30.4 percentage point improvement)GPT-5.1 Agent achieved a 52.8% accuracy (a +9.3 percentage point improvement)Three findings that matter for enterprise deploymentsThe testing identified critical insights for practitioners:Parsing remains the fundamental blocker: Complex tables with nested headers, merged cells and unusual formatting frequently produce misaligned values. Even when given exact oracle pages, agents struggled primarily due to parsing errors, although performance roughly doubled with pre-parsed documents.Document versioning creates ambiguity: Financial and regulatory documents get revised and reissued, meaning multiple valid answers exist depending on the publication date. Agents often stop searching once they find a plausible answer, missing more authoritative sources.Visual reasoning is a gap: About 3% of questions require chart or graph interpretation, where current agents consistently fail. For enterprises where data visualizations communicate critical insights, this represents a meaningful capability limitation.How enterprises can use OfficeQAThe benchmark's design enables specific improvement paths beyond simple scoring. "Since you're able to look at the right answer, it's easy to tell if the error is coming from parsing," Elsen explained. This automated evaluation enables rapid iteration on parsing pipelines. The verified ground truth answers also enable RL training similar to coding benchmarks, since there's no human judgment required.Elsen said the benchmark provides "a really strong feedback signal" for developers working on search solutions. However, he cautioned against treating it as training data."At least in my imagination, the goal of releasing this is more as an eval and not as a source of raw training data," he said. "If you tune too specifically into this environment, then it's not clear how generalizable your agent results would be."What this means for enterprise AI deploymentsFor enterprises currently deploying or planning document-heavy AI agent systems, OfficeQA provides a sobering reality check. Even the latest frontier models achieve only 43% accuracy on unprocessed PDFs and fall short of 70% accuracy even with optimal document parsing. Performance on the hardest questions plateaus at 40%, indicating substantial room for improvement.Three immediate implications:Evaluate your document complexity: If your documents resemble the complexity profile of Treasury Bulletins (scanned images, nested table structures, cross-document references), expect accuracy well below vendor marketing claims. Test on your actual documents before production deployment.Plan for the parsing bottleneck: The test results indicate that parsing remains a fundamental blocker. Budget time and resources for custom parsing solutions rather than assuming off-the-shelf OCR will suffice.

Plan for hard question failure modes: Even with optimal parsing, agents plateau at 40% on complex multi-step questions. For mission-critical document workflows that require multi-document analysis, statistical calculations or visual reasoning, current agent capabilities may not be ready without significant human oversight.For enterprises looking to lead in AI-powered document intelligence, this benchmark provides a concrete evaluation framework and identifies specific capability gaps that need solving.