Subquadratic – Introducing SubQ 1.1 Small

Date June 16, 2026

The hardest enterprise AI problems share a common shape. They require reasoning over complete artifacts: entire codebases, document collections, contracts, financial filings. For years, the industry worked around this problem by building retrieval pipelines, chunking strategies, and agentic scaffolding — useful tools, but ultimately workarounds for context limitations of the model architecture. The underlying constraint was attention: compute that scales quadratically with context length, making direct reasoning over large artifacts prohibitively expensive. SubQ is built to remove that constraint. Today we're releasing the model card for SubQ 1.1 Small — the second iteration of our Subquadratic Sparse Attention (SSA) model, at the smallest size. We are in the process of deploying SubQ 1.1 Small with select design partners and plan to deploy a broader lineup of models ranging from 2M to 12M tokens later in the year. Read Technical Report →

Key Features Near-perfect long-context retrieval up to 12M tokens on the needle-in-a-haystack test, with up to nearly 1,000x attention compute reduction.

A balance of long-context optimization and general reasoning ability, with strong performance retained across knowledge, coding, and non-coding enterprise agent benchmarks.

At 1M tokens, SubQ 1.1 Small requires 64.5x less compute than dense attention and runs 56x faster than FlashAttention-2. These results reflect the scaling advantage that SSA's efficiency gains make possible.

Benchmarks SubQ 1.1 Small was evaluated across five axes, covering long-context retrieval, context-length generalization, knowledge, coding, and long-horizon agentic tasks.

Long-Context Retrieval & Generalization We selected Needle-In-A-Haystack (NIAH) and Nvidia's RULER test because together they test whether the model can find a single fact buried deep in a large context, and whether it can connect the dots across that context. NIAH is the precision test. It places one retrievable fact at a controlled depth within a long context and asks the model to return it exactly. SubQ 1.1 Small scores near-perfect at 1M, 2M, 6M, and 12M tokens. The model was trained predominantly at 1M tokens yet the retrieval held near perfectly at 12x that length, despite compressing attention to just 0.13% of relationships. This generalization is a direct consequence of SSA routing attention based on content relevance rather than fixed positional patterns. RULER is the capability test. It's 13 tasks go beyond single-fact lookup to cover multi-hop variable tracing, frequency extraction, and aggregation across the full context using the kind of reasoning complete-artifact workloads actually require. SubQ 1.1 Small scores 99.12% at 128K. Multi-task retrieval RULER (128K) 99.12% 128K Single-fact retrieval Needle-in-a-haystack (1M–12M) 100% 1M 100% 2M 98% 6M 98% 12M

General Knowledge & Reasoning SubQ 1.1 Small balances long-context optimization with general reasoning ability without compromise. GPQA Diamond at 85.4% sits just below mid-tier frontier models and well above the smaller tier. LiveCodeBench at 89.7% pass@4 is close to the absolute frontier. AutomationBench Finance at 13% places SubQ 1.1 Small close to the strongest models on that benchmark, ahead of mid-tier and smaller baselines. Absolute scores remain low across all models on this benchmark. Benchmark SubQ 1.1 Small GPT-5.5 Opus 4.8 Sonnet 4.6 GPT-5.4-mini GPT-5.4-nano Haiku 4.5 Graduate-level science GPQA Diamond · pass@1 85.4 93.2 92 87.5 87.5 81.7 67.2 Agentic finance AutomationBench 13% 18% 16% 8% 0% n/r 3% Competitive programming LiveCodeBench v6 · pass@4 89.7 92 92.2 88.9 78.6 78.2 69.7 n/r = result not reported by the model provider

Efficiency SSA replaces the O(n²) dense attention pass with a learned sparse formulation that scales linearly with context length. SSA's advantage over dense attention grows as context length increases. At 1M tokens, SubQ requires 64.5x fewer compute than dense attention and runs 56x faster than FlashAttention-2 on a single attention layer. In practice, this drastically changes the economics of long-context training and inference. A full breakdown of the mechanism and how it compares to FlashAttention, DeepSeek sparse attention, and recurrent architectures is in the Technical Report. SubQ uses 64.5x less compute than dense attention, and is 56× faster than FlashAttention-2 at 1M-token context

Third-Party Evaluation The benchmark results above were independently verified by Appen. Link to full report here.

... continue reading