Skip to content
Tech News
← Back to articles

Show HN: A new benchmark for testing LLMs for deterministic outputs

read original get LLM Deterministic Output Benchmark → more articles
Why This Matters

The Structured Output Benchmark (SOB) introduces a more comprehensive way to evaluate large language models' ability to produce accurate, structured data outputs, which is critical for downstream system reliability. By addressing limitations of existing benchmarks, SOB helps ensure models can handle complex, real-world data extraction tasks with greater precision, benefiting both developers and end-users in industries relying on structured data. This advancement promotes the development of more dependable LLMs for enterprise applications and automation workflows.

Key Takeaways

Introducing the Structured Output Benchmark (SOB) copy markdown

LLMs are increasingly deployed to produce structured data from unstructured and semi-structured sources, parsing invoices, medical records, meeting transcripts, and converting PDFs to database rows.

For deterministic output, the next step in a workflow reads a specific key and expects a specific type. A hallucinated invoice_total or an array ordered incorrectly because of inaccurate date values silently breaks downstream systems. Yet existing benchmarks either check schema compliance alone or evaluate value correctness within a single source domain.

Top 5 at a glance

A side-by-side look at the top 5 models across all seven metrics. The structural metrics (JSON Pass, Path Recall, Structure Coverage, Type Safety) cluster near the ceiling for every model, while Value Accuracy and Perfect Response separate them.

The problem with current structured output benchmarks

Most benchmarks collapse "structured output quality" into a single number: does the response parse, and does it validate against the schema? That's necessary, not sufficient.

Problem in current benchmarks What it misses Schema compliance as the only metric A model can emit perfectly valid JSON with wrong values and score 100% Single-source inputs (text only) Real systems extract from OCR, screenshots, meeting audio, and PDFs, not just clean text No difficulty weighting Medium and hard schemas are scored identically, hiding which models actually handle nested structure No separation of parse / structure / value errors You can't tell if a model failed at JSON, at the schema, or at the facts Reasoning / chain-of-thought blended in Results measure reasoning + extraction together, not the extraction capability itself

References to existing benchmarks: JSONSchemaBench | StructEval | DeepJSONEval | LLMStructBench

How SOB works

... continue reading