Introducing the Structured Output Benchmark (SOB) copy markdown
LLMs are increasingly deployed to produce structured data from unstructured and semi-structured sources, parsing invoices, medical records, meeting transcripts, and converting PDFs to database rows.
For deterministic output, the next step in a workflow reads a specific key and expects a specific type. A hallucinated invoice_total or an array ordered incorrectly because of inaccurate date values silently breaks downstream systems. Yet existing benchmarks either check schema compliance alone or evaluate value correctness within a single source domain.
Top 5 at a glance
A side-by-side look at the top 5 models across all seven metrics. The structural metrics (JSON Pass, Path Recall, Structure Coverage, Type Safety) cluster near the ceiling for every model, while Value Accuracy and Perfect Response separate them.
The problem with current structured output benchmarks
Most benchmarks collapse "structured output quality" into a single number: does the response parse, and does it validate against the schema? That's necessary, not sufficient.
Problem in current benchmarks What it misses Schema compliance as the only metric A model can emit perfectly valid JSON with wrong values and score 100% Single-source inputs (text only) Real systems extract from OCR, screenshots, meeting audio, and PDFs, not just clean text No difficulty weighting Medium and hard schemas are scored identically, hiding which models actually handle nested structure No separation of parse / structure / value errors You can't tell if a model failed at JSON, at the schema, or at the facts Reasoning / chain-of-thought blended in Results measure reasoning + extraction together, not the extraction capability itself
References to existing benchmarks: JSONSchemaBench | StructEval | DeepJSONEval | LLMStructBench
How SOB works
... continue reading