TurboQuant: A first-principles walkthrough

§0 · Primer: jargon decoder Eight ideas the rest of the page is built on. Each mini-demo below covers one concept used later. Skip the ones you already know. §0.1 · Vector A list of numbers. An arrow in space. A vector is an ordered list: [0.3, −1.2] . Geometrically it is an arrow from the origin. A d-dimensional vector is an arrow in $d$-space, hard to picture past 3-D, but the rules are the same. ↕ drag tip coords [0.70, 0.50] length 0.86 §0.2 · Length ‖x‖ & Inner Product ⟨x,y⟩ How much one vector points along another. Length = $\sqrt{x_1^2+x_2^2+\dots}$. Inner product $\langle x,y\rangle = x_1 y_1 + x_2 y_2 + \dots = \|x\|\|y\|\cos\theta$. The inner product reaches its largest positive value when the two arrows point in the same direction. It drops to zero when the two arrows are perpendicular. It becomes negative when the arrows point in opposite directions, with its most negative value when they point exactly opposite. ↕ drag either tip ‖x‖ 1.00 ‖y‖ 1.00 ⟨x,y⟩ 0.00 angle 90° §0.3 · Mean Squared Error Why we square the mistake. Error is the distance between a guess and the truth. Scoring a guess by the signed error lets positive and negative errors cancel, which means the score does not penalise being off. Squaring forces every error to count as a positive number and gives big errors a larger penalty than small ones. The guess that minimises the mean of squared errors is the data’s average: it is the unique number that minimises the sum of squared distances to the points. The first moment of a quantity $X$ is its mean $\mathbb{E}[X]$; the second moment is the mean of its square $\mathbb{E}[X^2]$. A zero-mean variable has a vanishing first moment because positive and negative deviations cancel. Its second moment is strictly positive whenever any deviation is nonzero, because squared values are nonnegative and cannot cancel. The MSE above is itself a second moment of the residual error. This distinction returns in §7, where the per-input gap $\tilde y - y$ averages to zero in the first moment, while its square averages to a strictly positive quantity in the second. The average has a property we will use in §7. It lies between the data’s most extreme points, so its magnitude is smaller than at least one of them. When a quantizer compresses a whole bin of values down to the bin’s average, the stored value is smaller in magnitude than the bin’s largest values. The reconstruction is a shrunken version of the input. An inner product against a shrunken reconstruction comes out smaller than the same inner product against the input. Guess 0.00 mean of data 0.00 MSE at guess 1.00 MSE at mean 1.00 §0.4 · Unbiased vs Biased Estimator Noisy is fine. Systematically off is not. An estimator is a procedure that takes data and returns a guess $\hat\theta$ for an unknown truth $\theta$. Repeat it on fresh data and the guesses form a cloud. The cloud can fail in two independent ways. Variance is one: individual guesses are noisy. Bias is the other: the procedure is wrong even after averaging many guesses. An estimator with $\mathbb{E}[\hat\theta]=\theta$ is unbiased; the cloud’s centre sits at $\theta$ regardless of the cloud’s width. The bullseye below shows both failure modes. Bias is the distance from the cloud’s centre to the crosshair. Variance is the width of the cloud. The two quantities are independent of each other. §7 runs the same bullseye against the MSE quantizer of §6, and the cloud’s centre lands away from the crosshair. §8 runs it against a different estimator whose cloud centres on the crosshair. Mode Unbiased, noisy Biased, systematically off Fire 50 shots Clear shots 0 mean of shots – bias – §0.5 · Rotation A rigid spin. Preserves lengths and angles. A rotation matrix $R$ spins space. The key property: $\|Rx\|=\|x\|$ and $\langle Rx,Ry\rangle=\langle x,y\rangle$. Rotation only changes the basis the coordinates are written in, not the geometry. ↕ drag tip Angle θ 0° ‖x‖ before → after 1.41 → 1.41 preserved? yes §0.6 · Where bell-curves come from (CLT) Add up many small randoms → Gaussian. The Central Limit Theorem says that summing enough independent random numbers produces a distribution close to a bell curve. The shape of each individual term in the sum does not affect the limit. A sum of coin flips converges to the same Gaussian shape as a sum of uniform draws or a sum of skewed draws. A rotated coordinate is one of these sums: it is a weighted combination of every coordinate of the original vector, with random weights. After a random rotation, each new coordinate is therefore approximately Gaussian, which is the property TurboQuant relies on for every input. Terms in sum n 1 Source ±1 coin uniform [−1,1] exponential (skewed) Resample 20k source shape ±1 coin converged? no, n too small §0.7 · Life in many dimensions Coordinates of a random unit vector are all small. Pick a random point on a unit sphere in $d$ dimensions. In 2-D any coordinate is possible. In 100-D, almost every coordinate is close to $\pm 1/\sqrt{d}$. This is measure concentration, and it is the core fact TurboQuant exploits. Dim d 2 std of x₁ ≈ 0.71 1/√d 0.71 §0.8 · Quantization, in one dimension Snap every number to the nearest of $2^b$ levels. This is what $b$ bits per number means. With $b=2$ you get 4 levels, $b=3$ gives 8. The gap between levels is your worst-case error. Adding one bit halves the gap, so the squared error drops by 4× per bit, the $4^{-b}$ factor that shows up later. Bits b 2 levels 4 gap Δ 0.667 max error 0.333 ■ CHEAT SHEET Eight ideas, one sentence each Vector: ordered list of numbers / arrow from the origin. Length & inner product: the norm $\sqrt{\sum x_i^2}$ and how much two vectors point the same way. MSE: average squared error. Unbiased: the average of many estimates equals the truth. Rotation: change of basis that preserves lengths and angles. CLT: sum of many independent randoms converges to a Gaussian. High-D concentration: coordinates of a random unit vector in $d$-space cluster near $\pm 1/\sqrt d$. Quantization: snap each number to one of $2^b$ levels; one extra bit quarters the squared error.

§1 · Vector quantization What is vector quantization, really? You have a vector $\mathbf{x}\in\mathbb{R}^d$, say $d{=}1536$ floats from an OpenAI embedding. You want to store it using $b$ bits per coordinate (total $b\cdot d$ bits), then later recover an approximation $\tilde{\mathbf{x}}$ close to $\mathbf{x}$. Closeness is measured by MSE distortion $D_{\text{mse}} = \mathbb{E}\big[\,\|\mathbf{x} - \tilde{\mathbf{x}}\|_2^2\,\big]$ or inner-product error $D_{\text{prod}} = \mathbb{E}\big[\,|\langle\mathbf{y},\mathbf{x}\rangle - \langle\mathbf{y},\tilde{\mathbf{x}}\rangle|^2\,\big]$ The second one matters because attention scores and nearest-neighbor queries are all inner products. We would like the estimator to be unbiased: $\mathbb{E}[\langle\mathbf{y},\tilde{\mathbf{x}}\rangle] = \langle\mathbf{y},\mathbf{x}\rangle$. ■ KEY WORDS MSE distortion: average squared error between the true vector and its reconstruction, primer §0.3. Inner product $\langle y, x\rangle$: how much two vectors point the same way, primer §0.2. This is what attention computes. Estimator: a rule (here: quantize, then decode) that returns an approximation $\hat s$ of a true number $s$. Unbiased estimator: across many queries, the average of $\hat s$ equals $s$. Individual estimates can be noisy; the mean is on target. Primer §0.4. The obvious quantizer For each coordinate, pick the closest of $2^b$ evenly-spaced levels in $[-1, 1]$. That is $b$ bits per number. The same rule runs in 2D and 3D first, where the geometry is visible, before the high-dimensional version below. First, in 2D Drag the tip of the vector. The vector snaps to the nearest point of a $2^b \times 2^b$ grid. The green arrow shows the original input. The blue arrow shows where the input is quantized to. The red segment between them is the reconstruction error $\mathbf{x} - \tilde{\mathbf{x}}$. Bits b 2 Preset: custom (drag) spike (0.95, 0.05) balanced (0.7, 0.7) small (0.3, 0.2) ↕ drag tip ‖error‖ / ‖x‖ – levels per axis 4 grid points 16 Same trick in 3D A $2^b$-level grid on three axes gives $2^{3b}$ snap points. Drag the canvas to orbit the view. The spike preset shows where the construction breaks: the input lies near one axis and falls between two grid levels, which is where the reconstruction error is largest. Bits b 2 Preset: spike (0.95, 0.15, 0.05) balanced (1,1,1)/√3 outlier (0.9, 0.4, 0.1) diagonal (0.7, 0.7, 0.1) custom (drag the tip) ↕ drag tip · ↻ orbit ‖error‖ / ‖x‖ – levels per axis 4 grid points 64 Now at scale (d up to 128) The same rule applied to every coordinate of a high-dimensional vector. You cannot see the grid anymore, but the per-coordinate errors are still there. Bits b 3 Dimension d 64 Input: random unit vector (uniform) adversarial: one spike coordinate random Gaussian few large coords (sparse) Resample original $x_i$ quantized $\tilde{x}_i$ ‖x − x̃‖² / ‖x‖² – levels per coord – bits used – Select the spike input. The naive quantizer's grid is spaced evenly over $[-1, 1]$. The input has almost all of its magnitude in a single coordinate, whose value falls between the two grid levels nearest to it and so reconstructs poorly. The remaining coordinates are near zero and consume most of the levels despite carrying little of the input's information. ■ TAKEAWAY · NEXT §2 where the gap shows up A fixed grid produces small reconstruction errors on inputs whose coordinates are roughly uniform in magnitude, and large reconstruction errors on inputs whose magnitude is concentrated in one or a few coordinates. Next: §2 shows how production systems handle the second case and what they pay for the fix.

§2 · Why naive fails The adversarial coordinate, and why production systems pay a tax Real embeddings are rarely flat. Trained models produce outlier channels, a few coordinates much larger than the rest. A fixed $[-L, L]$ grid either clips the outliers or wastes resolution on the bulk. Production quantizers (GPTQ, AWQ, KIVI, KVQuant) work around this by computing $(\min, \max)$ (or zero-point and scale) for every small block and storing those in full precision as side information. The catch. To decode any block you also need its scale and zero-point, two float16 numbers (32 extra bits) stored next to every 16–64 quantized values. Walk through one case: a block of 32 numbers at 3 bits each is 96 payload bits, plus 32 metadata bits, which works out to 4 bits per number, not 3. Smaller blocks of 16 numbers push it to 5 bits per number. The advertised 3-bit scheme is really a 4–5-bit scheme once you count everything. TurboQuant matches this worst-case quality while storing zero per-block metadata. ■ DEMO · feel the catch same b bits/value, three strategies A 64-dimensional vector whose coordinates are mostly small, with one large outlier shown in red. Three quantizers reconstruct the same vector at the same b -bit budget. Strategy A uses a single fixed grid for the whole vector. Strategy B adapts the grid per block, at the cost of a float16 header per block. Strategy C rotates the vector first and then applies a single fixed grid. The metrics report the RMSE of each reconstruction and the effective bits-per-value once the metadata cost is included. Outlier magnitude 4.0 Bit budget b 3 Block size s 16 New vector Source vector (outlier in red, dashed lines = fixed grid range) A. Fixed grid [−L, L] one global range, b bits/value, no header. Outlier clips. RMSE – bits/value – overhead 0% B. Per-block scale + zero float16 scale+zero per block (dashed dividers). Outlier fits, header taxes you. RMSE – bits/value – overhead – C. Rotate → fixed grid rotation smears the spike across all 64 coords. One global grid works, no header. RMSE – bits/value – overhead 0% Read the storage line. The effective bits-per-value works out to b + 32/s for the per-block scheme and to b for the other two, because only the per-block scheme stores a float16 scale and zero-point (32 bits together) for every block of s elements. At b=3, s=16 the per-block cost works out to 3 + 2 = 5 bits/value, a 66% surcharge over the nominal b . Strategy C achieves the same storage cost as strategy A while producing the reconstruction quality of strategy B. The rest of this page explains the construction that makes that possible. ■ TAKEAWAY · NEXT §3 one fixed recipe, any input Production quantizers handle outliers by paying a per-block metadata tax. TurboQuant must instead be data-oblivious: a single procedure that runs on every vector with no calibration set and no per-block headers. Next: §3 introduces the move that makes a fixed grid work for every input.

§3 · The rotation trick Multiply by a random rotation. Watch the spike dissolve. The rotation trick: apply a random orthogonal transform $\boldsymbol{\Pi}$, then quantize coordinate-wise. Rotation is lossless, it preserves length and inner products exactly: $\|\boldsymbol{\Pi}\mathbf{x}\|_2 = \|\mathbf{x}\|_2$ · $\langle \boldsymbol{\Pi}\mathbf{x},\,\boldsymbol{\Pi}\mathbf{y}\rangle = \langle\mathbf{x},\mathbf{y}\rangle$ · $\boldsymbol{\Pi}^{\!\top}\boldsymbol{\Pi} = \mathbf{I}$ Because rotation is exact, all reconstruction error comes from the quantization step alone. After a uniformly random rotation, every coordinate of $\boldsymbol{\Pi}\mathbf{x}$ follows the same fixed Beta density (Lemma 1 of the paper), regardless of what $\mathbf{x}$ looked like. A single codebook designed once for that density is then optimal for every input. We build the codebook in §5. How to construct $\boldsymbol{\Pi}$ Generate a $d\times d$ matrix of i.i.d. $\mathcal{N}(0,1)$ entries and run QR decomposition; keep the orthogonal factor $Q$. The result is uniform on the orthogonal group $O(d)$, which is what Lemma 1 needs. A spike in 2D Start with the extreme case: a vector with all of its magnitude in one coordinate, $(1, 0)$. Rotate by angle $\theta$ and observe how the magnitude is redistributed across the two coordinates. At $\theta{=}45°$ the magnitude is split evenly between the two coordinates, giving $(\tfrac{1}{\sqrt 2}, \tfrac{1}{\sqrt 2})$. The total length of the vector stays the same throughout. Angle θ 30° geometry ↕ drag tip coordinate magnitudes max |coord| – length ‖x‖ 1.000 θ at even split 45° A spike in 3D The same construction in three dimensions. The spike $(1, 0, 0)$ is rotated by a random orthogonal matrix, which spreads the input's magnitude across all three coordinates of the output. The total length of the vector is preserved. Each fresh draw of the random rotation produces a different spread. Draw fresh rotation geometry (drag to orbit) ↕ drag tip · ↻ orbit coordinate magnitudes max |coord| – length ‖x‖ 1.000 typical max at random rot. ≈ 0.80 At high dimension A single rotation in 2-D reduces the largest coordinate to at most half the input's magnitude. A random rotation in 3-D typically leaves one coordinate around $0.7$. At $d{=}64$ the largest coordinate after rotation is around $1/\sqrt d \approx 0.125$, regardless of how concentrated the input was. Dimension d 64 Input: spike (adversarial) 5 large coords 1 outlier + bulk Gaussian (already flat) Draw fresh rotation Π before, $|x_i|$ (Cartesian) after, $|(\boldsymbol{\Pi}\mathbf{x})_i|$ max |xᵢ| / ‖x‖ – max |(Πx)ᵢ| / ‖x‖ – ‖x‖ (preserved) – ■ TAKEAWAY · NEXT §4 no spike survives a random rotation Rotation preserves length and inner products. The only thing it changes is which coordinates contain the magnitude of the vector. A vector with all of its mass concentrated in one coordinate becomes, after rotation, a vector whose mass is spread across all $d$ coordinates. Because every input is rotated before quantization, every input that gets quantized is of this spread-out kind. Next: §4 explains why rotation flattens spikes using the geometry of high-dimensional spheres.

§4 · Why rotation works Coordinates of random unit vectors are nearly Gaussian. Rotating $\mathbf{x}$ by a uniformly random $\boldsymbol{\Pi}$ is the same as picking a random point on the sphere of radius $\|\mathbf{x}\|$. So the question “what does a coordinate of $\boldsymbol{\Pi}\mathbf{x}$ look like?” is the same question as “what does a coordinate of a uniform point on the sphere look like?” In low dimensions the answer is far from a bell curve. In 2-D the marginal is the arcsine density, which is U-shaped with peaks at $\pm 1$. In 3-D it is uniform on $[-1, 1]$. As $d$ grows the marginal narrows and converges to a Gaussian with variance $1/d$. The convergence is visible in the demos that follow. The exact density (Lemma 1) For a uniform point on $\mathbb{S}^{d-1}$, the marginal density of any single coordinate is $f_X(x) \;=\; \dfrac{\Gamma(d/2)}{\sqrt{\pi}\,\Gamma((d-1)/2)}\,(1-x^2)^{(d-3)/2},\quad x\in[-1,1]$ a scaled/shifted Beta distribution. It converges pointwise to $\mathcal{N}(0,\,1/d)$ as $d\to\infty$. Step one: the circle ($d=2$) Sample 2000 points uniformly from the unit circle and look at a single coordinate, say $x_1$. The marginal is the arcsine density $\tfrac{1}{\pi\sqrt{1-x^2}}$, which is U-shaped with peaks at $\pm 1$. The shape is far from Gaussian: any value of $x_1$ between $-1$ and $+1$ is possible, and the endpoints are more likely than the middle. points on the unit circle marginal of $x_1$ shape of $x_1$ arcsine std of $x_1$ – $1/\sqrt{d}$ 0.707 Step two: the sphere ($d=3$) Now sample uniformly from the unit sphere in 3-D. The marginal of one coordinate is uniform on $[-1, 1]$ (Archimedes' hat-box theorem). The marginal is still not a bell curve. Drag to orbit the view. points on the unit sphere ↻ drag to orbit marginal of $x_1$ shape of $x_1$ uniform on $[-1,1]$ std of $x_1$ – $1/\sqrt{d}$ 0.577 Step three: high dimensions Drag $d$ upward. The marginal narrows and converges to a Gaussian with standard deviation $1/\sqrt d$. By $d{=}30$ the marginal is visually Gaussian. By $d{=}256$ almost all of the mass concentrates within a thin shell of width $\sim 1/\sqrt d$ around zero. Dimension d 32 Samples 10000 Resample empirical histogram Beta PDF (exact) $\mathcal{N}(0, 1/d)$ approximation Distinct coordinates are also approximately independent, a stronger condition than uncorrelated, and what is actually needed for the per-coordinate quantization argument below. ■ TAKEAWAY · NEXT §5 one distribution, one codebook Every coordinate of a rotated vector follows the same known density. The scalar quantization problem for that density can be solved once, and the solution can be reused for every coordinate of every vector. There are no per-block scale factors and no side information to store. Next: §5 builds the codebook with Lloyd–Max.

§5 · The universal codebook Lloyd–Max: the optimal partition of a known distribution. Every rotated coordinate looks like a draw from the same density (§4). So there is one scalar problem to solve, once: pick $2^b$ landing values on the number line such that snapping any sample to its nearest landing value introduces as little error as possible. Those landing values are the codebook. A classical algorithm finds them: Lloyd–Max (Lloyd 1957/82, Max 1960). Because the density is fixed and known in advance, Lloyd–Max runs once at table-build time. The resulting landing values are saved into a tiny per-$b$ table. Encoding a coordinate after that is a single nearest-neighbour lookup against the table. The same table is used for every input, with no calibration step and no per-vector tuning. Drag $b$ below to watch Lloyd–Max settle on the landing values for the Beta density. The Lloyd–Max iteration Given a PDF $f_X$, choose centroids $c_1 \le \dots \le c_{2^b}$ minimising $\int (x - c_{i(x)})^2 f_X(x)\,dx$ by alternating: Assignment: each centroid owns the Voronoi cell around it, boundaries are midpoints between adjacent centroids.

each centroid owns the Voronoi cell around it, boundaries are midpoints between adjacent centroids. Update: each centroid moves to the conditional mean of its cell, $c_k \leftarrow \mathbb{E}[X \mid X \in \text{cell}_k]$. Repeat until stable. The demo runs this on the Beta density of §4. Bits b 2 Dimension d 64 ▶ 1 Lloyd step ▶▶ converge reset Gaussian $\mathcal{N}(0,1/d)$ centroids $c_k$ bin boundaries iteration 0 MSE per coord – Shannon bound 1/4^b / d – For moderate $d$, the paper's explicit centroids (after normalising by $\sqrt{d}$) are: $b{=}1\!:\pm\sqrt{2/\pi}$, $b{=}2\!:\{\pm 0.453,\pm 1.510\}$, and so on. Theorem 1 proves the per-coordinate MSE is $\lesssim \tfrac{\sqrt{3}\pi}{2d}\cdot 4^{-b}$. The constant $\tfrac{\sqrt{3}\pi}{2}\approx 2.72$ is the asymptotic ratio to Shannon's minimum $\tfrac{1}{d}\cdot 4^{-b}$; at $b{=}1$ the paper reports a tighter ratio of $\approx 1.45$. ■ TAKEAWAY · NEXT §6 a tiny lookup, baked once Lloyd–Max gives the optimal partition for a known density, so the centroids for the Beta marginal can be precomputed and stored as a tiny per-$b$ table. The per-coordinate MSE that the resulting codebook achieves is within a factor of $\approx 2.72$ of Shannon's lower bound asymptotically and within $\approx 1.45$ at $b{=}1$. Next: §6 assembles rotation and codebook into TurboQuant-MSE.

§6 · TurboQuant-MSE Putting it together: TurboQuant-MSE. STEP 1 Rotate $\mathbf{y} = \boldsymbol{\Pi}\mathbf{x}$. Same $\boldsymbol{\Pi}$ reused for every vector. STEP 2 Round each coord For each $j$, $\texttt{idx}_j = \arg\min_k |y_j - c_k|$. Stores $b$ bits. STEP 3 Store Total: $b\!\cdot\!d$ bits. No scales, no zero-points. STEP 4 Look up $\tilde{y}_j = c_{\texttt{idx}_j}$ from the universal codebook. STEP 5 Rotate back $\tilde{\mathbf{x}} = \boldsymbol{\Pi}^{\!\top}\tilde{\mathbf{y}}$. Done. Bits b 3 Dimension d 64 Input: adversarial spike outlier channel random unit Gaussian Resample x (original) Πx (rotated, nearly Gaussian) quantized Πx (snap to codebook) x̃ = Πᵀ·quant(Πx) (recovered) error x − x̃ ‖x − x̃‖² / ‖x‖² – naïve (no rotation) – Shannon floor 1/4^b – compression factor – Toggle between input types. Naive quantization without rotation fails on the spike input and on the outlier-channel input. With the rotation step in front, the reconstruction error is roughly the same regardless of which input is selected. Every rotated coordinate follows the same $\mathcal{N}(0,\,1/d)$ distribution, which is the distribution the codebook was designed for. ■ TAKEAWAY · NEXT §7 MSE is solved, but… TurboQuant-MSE stores $b\cdot d$ bits per vector and zero metadata. The reconstructed $\tilde{\mathbf{x}}$ is nearly as close to the original $\mathbf{x}$ as any quantizer can achieve, within a factor of $\approx 2.72$ of Shannon's information-theoretic lower bound. Next: §7 shows that the same codebook produces a systematically biased estimate of inner products. This is an error that minimising reconstruction MSE does not address.

§7 · The inner-product bias MSE-optimal quantizers underestimate inner products. §6’s TurboQuant-MSE keeps $\tilde{\mathbf{x}}$ close to $\mathbf{x}$ in squared distance. Attention does not measure $\|\mathbf{x}-\tilde{\mathbf{x}}\|^2$. It computes $\langle \mathbf{q}, \tilde{\mathbf{k}}\rangle$ and uses that number as a stand-in for $\langle \mathbf{q}, \mathbf{k}\rangle$. The MSE codebook gives a systematically wrong answer to the inner-product question. Each trial returns the same error, so averaging many trials does not remove it. Two earlier facts produce the shrinkage. In §0.3 the MSE-optimal reconstruction for a set of values was the set’s average, and that average had smaller magnitude than the set’s extreme values. In §4 a random rotation made every coordinate of $\boldsymbol{\Pi}\mathbf{x}$ behave like a zero-mean draw with most of its mass close to 0. Combine the two and the shrinkage is forced: the encoder partitions each axis into $2^b$ bins and stores only which bin $\boldsymbol{\Pi}\mathbf{x}$ fell into, the decoder reconstructs with the bin’s average, and the bin’s average sits closer to 0 than the tail inputs that fall into the same bin. The reconstruction $\tilde{\mathbf{x}}$ is therefore a shrunken copy of $\mathbf{x}$, and an inner product $\langle \mathbf{q}, \tilde{\mathbf{k}}\rangle$ comes out smaller than $\langle \mathbf{q}, \mathbf{k}\rangle$. Because the codebook is fixed, the shrinkage factor is identical on every trial. ■ SEE THE SHRINKAGE drag y , watch ỹ snap One rotated coordinate $y$ has the near-Gaussian density drawn on top. Lloyd–Max partitions the axis into $2^b$ bins (interior verticals); each bin’s centroid is the MSE-optimal reconstruction (red dots). Drag the mint handle to set $y$. The encoder snaps it to the centroid of the bin it fell into, giving $\tilde y$ (red). The staircase underneath plots that map $\tilde y(y)$ across the whole axis at once: every horizontal step sits inside the dashed identity line, and the gap between step and identity is the shrinkage at that input. Bits b 1 ↔ drag y Variance budget. σ² = 1 splits into the part ỹ keeps and the part erased inside each bin. ■ kept by ỹ: λ b = 0.637 ■ erased in-bin variance: D b = 0.363 y 1.500 ỹ – (ỹ − y)²

instantaneous, averages to D b – λ b = E[ỹ²]/σ²

... continue reading