02 Blog
The most direct way to compress an embedding (other than quantization) is to fit PCA on the corpus and keep the top-d eigenvectors. It works, but PCA is a linear projection, and neural-network embeddings on the sphere are structurally nonlinear — the well-known cone effect in transformers. Some of the variance lives in a nonlinear tail that a linear decoder can’t reach.
This post is about a closed-form way to add a quadratic decoder on top of PCA, to capture part of that nonlinear tail. The encoder stays as plain PCA. The decoder is a degree-2 polynomial lift plus Ridge OLS (ordinary linear regression with L2 regularization), also closed-form. No SGD, no epochs, no hyperparameter search. One np.linalg.solve over corpus statistics.
The construction itself isn’t mine. “PCA encoder + quadratic decoder + least-squares fit” appears in the dynamical-systems literature under the name quadratic manifold (see Jain 2017, Geelen-Willcox 2022/2023, Schwerdtner-Peherstorfer 2024+ — more in §9). I only stumbled onto these papers after running my experiments and believing the construction was new. The polynomial lift doesn’t come up much in modern ML conversations, and this post is a note about a useful trick from an adjacent discipline carrying over to retrieval.
The concrete result. BEIR/FiQA, mxbai-embed-large-v1 (1024d), per-vector budget 512 bytes. Metric is NDCG@10 (Normalized Discounted Cumulative Gain over the top-10, the standard retrieval ranking-quality measure; range [0, 1], higher is better):
method NDCG@10 Δ vs raw 1024d raw 1024d (4096 bytes) 0.4525 — PCA top-256 0.4168 -3.58 p.p. poly-AE 256d 0.4441 -0.85 p.p. matryoshka top-256 0.4039 -4.86 p.p.
PCA already gives 4× per-vector memory compression at -3.58 p.p. NDCG. A quadratic decoder on top of PCA pulls another +2.73 p.p. — closing almost the entire gap to raw, at the same byte budget. Matryoshka is in the table as another familiar baseline (here it drops more than PCA — a known side observation, not the central claim of this post; see §3 footnote and §4).
Measured on four models (nomic-v1.5, mxbai-large, bge-base, e5-base). Poly-AE over PCA: +1 to +4.4 p.p. at d=128, +0.03 to +2.7 p.p. at d=256. Full table in §3.
Full implementation — ~150 lines of numpy, MIT, repository github.com/IvanPleshkov/poly-autoencoder. The BEIR eval script is beir_eval.py in the same repo. Reproduces in 30–40 minutes on an M-series MacBook (10–15 min to encode the corpus, ~15 min for the Ridge solve at d=256).
Contents:
... continue reading