Computational complexity of schema-guided document extraction

When we started building Pulse, we assumed that structured outputs had solved the document extraction problem. Define a JSON schema, point an LLM at your documents, and get clean data back. Every major API provider now supports it. OpenAI, Anthropic, Google, and a growing ecosystem of open source tools like Outlines and XGrammar have made constrained generation accessible to anyone with an API key.

So our research team internally ran some evals - used a relatively simple document layout with bank statements and invoices, defined a schema, and watched it work. Then we tried it on ten thousand more complex documents with alternate currency markers, faxed documents with unwired tables, nested line items, conditional fields, and variable-length arrays. That's when things got interesting.

The short version: enforcing structured outputs during generation is not free, and for the kinds of complex, nested schemas that real document extraction requires, naive approaches either blow up in compute or silently degrade extraction quality.

How Constrained Decoding Actually Works

At each generation step, the model produces a probability distribution over its entire vocabulary (128,000+ tokens in OSS models like Llama). Constrained decoding works by masking out tokens that would violate the schema, setting their probabilities to zero before sampling. The model can only generate tokens that keep the output on a valid path through the grammar.

For simple constraints like regular expressions, this is efficient. The Outlines library showed you can precompute a finite state machine (FSM) from a regex and use it to generate token masks in O(1) time per step. For regular languages, the set of valid next tokens depends only on the current FSM state, not on the history of how you got there.

While a specific JSON schema can be mapped to a regular language, the recursive nature of nested objects and arrays makes it behave more like a context-free grammar in terms of parser complexity.

The State Explosion Problem

The XGrammar paper introduced a clever optimization. Although you can't precompute masks for every state, you can categorize tokens into two sets:

Context-independent tokens : validity can be determined by the current position in the PDA, ignoring the stack. For most grammars, this covers the vast majority of the vocabulary.

... continue reading