Systematically generating tests that would have caught Anthropic's top‑K bug

Most testing strategies miss rare edge cases until customers find them in production. Our system automatically generates targeted unit tests for rare bugs, including the one that would have caught Anthropic’s recent approximate top-K bug. In this blog post, we’ll provide a brief overview of how it works.

Figure 1: Unit-level PBTs are fast but miss edge cases. Proofs offer exhaustive coverage but require extensive reasoning and code refactoring. End-to-end PBTs have coverage but are not compute efficient. Fractional proofs sit at the intersection, using proof decomposition to generate targeted unit tests that balance compute efficiency, developer accuracy, and speed.

Catching the rare bug in top-K sampling

A bug in the TPU implementation of approximate top-K resulted in the most likely token sometimes being excluded. Rare bugs like this frequently slip through to production because covering every behavior with testing is infeasible in practice. After discovery, Anthropic provided a simple reproducer of the bug, but it is the sort of test you only manage to write after a laborious bug minimization process.

We used fractional proof decomposition to automatically generate the unit test without relying on Anthropic’s bug reproducer code. You can run the unit test on colab. For any code, if testing is done via fractional proof decomposition, bugs can be systematically found without the benefit of hindsight.

@given (k = st . integers(min_value = 0 , max_value = TOP_K_RANGE), arr = arr_strategy) def test_approx_max_k (k, arr): N = len(arr) k = int(k % min(N - MIN_TOP_K, TOP_K_RANGE)) + MIN_TOP_K approx_values, _ = lax . approx_max_k(arr, k = k) assert jnp . max(approx_values) == jnp . max(arr), \ Figure 2: Top-K sampling should always have some chance of picking the most likely token. We encode this property with a PBT (property-based test) for max(approximate_top_k(arr, k=k)) == max(arr) . If the implementation of lax.approx_max_k is correct, we should expect the test to pass because the approximate top-K algorithm is implemented by dividing data points into L bins and computing the true max in each bin. L is chosen based on the desired average recall r as .

Systematically generating tests via fractional proof decomposition

Figure 3: We encode theorems as property-based tests (PBTs), then recursively decompose them into smaller sub-theorems using reasoning, and fuzz the theorems, aka run PBTs, once the decomposition creates compute efficient unit-level PBTs.

Step 1: Identify the theorem, which is the property that your implementation must satisfy. Then, encode it as a PBT using the Hypothesis framework. We call the top-level theorem an end-to-end PBT because it corresponds to the end-to-end behavior of the function.

You can think of the PBT execution trace as a fractional version of the brute-force proof. A proof is to a theorem what the PBT execution is to the PBT code. Why combine proofs and PBTs? Standard PBTs are compute efficient but prone to gaps: developers miss edge cases, and rare bugs slip into production. Proofs solve this with modular steps that decompose and incorporate reasoning, revealing what cases we're missing—but they're computationally impractical.

... continue reading