Skip to content
Tech News
← Back to articles

A polynomial autoencoder beats PCA on transformer embeddings

read original get Autoencoder for Dimensionality Reduction → more articles
Why This Matters

This article introduces a polynomial autoencoder that enhances embedding compression by capturing nonlinear structures that PCA alone cannot, significantly improving retrieval performance within the same memory constraints. This approach offers a simple, closed-form solution that can be easily integrated into existing systems, making it highly relevant for optimizing neural network embeddings in the industry.

Key Takeaways

02 Blog

The most direct way to compress an embedding (other than quantization) is to fit PCA on the corpus and keep the top-d eigenvectors. It works, but PCA is a linear projection, and neural-network embeddings on the sphere are structurally nonlinear — the well-known cone effect in transformers. Some of the variance lives in a nonlinear tail that a linear decoder can’t reach.

This post is about a closed-form way to add a quadratic decoder on top of PCA, to capture part of that nonlinear tail. The encoder stays as plain PCA. The decoder is a degree-2 polynomial lift plus Ridge OLS (ordinary linear regression with L2 regularization), also closed-form. No SGD, no epochs, no hyperparameter search. One np.linalg.solve over corpus statistics.

The construction itself isn’t mine. “PCA encoder + quadratic decoder + least-squares fit” appears in the dynamical-systems literature under the name quadratic manifold (see Jain 2017, Geelen-Willcox 2022/2023, Schwerdtner-Peherstorfer 2024+ — more in §9). I only stumbled onto these papers after running my experiments and believing the construction was new. The polynomial lift doesn’t come up much in modern ML conversations, and this post is a note about a useful trick from an adjacent discipline carrying over to retrieval.

The concrete result. BEIR/FiQA, mxbai-embed-large-v1 (1024d), per-vector budget 512 bytes. Metric is NDCG@10 (Normalized Discounted Cumulative Gain over the top-10, the standard retrieval ranking-quality measure; range [0, 1], higher is better):

method NDCG@10 Δ vs raw 1024d raw 1024d (4096 bytes) 0.4525 — PCA top-256 0.4168 -3.58 p.p. poly-AE 256d 0.4441 -0.85 p.p. matryoshka top-256 0.4039 -4.86 p.p.

PCA already gives 4× per-vector memory compression at -3.58 p.p. NDCG. A quadratic decoder on top of PCA pulls another +2.73 p.p. — closing almost the entire gap to raw, at the same byte budget. Matryoshka is in the table as another familiar baseline (here it drops more than PCA — a known side observation, not the central claim of this post; see §3 footnote and §4).

Measured on four models (nomic-v1.5, mxbai-large, bge-base, e5-base). Poly-AE over PCA: +1 to +4.4 p.p. at d=128, +0.03 to +2.7 p.p. at d=256. Full table in §3.

Full implementation — ~150 lines of numpy, MIT, repository github.com/IvanPleshkov/poly-autoencoder. The BEIR eval script is beir_eval.py in the same repo. Reproduces in 30–40 minutes on an M-series MacBook (10–15 min to encode the corpus, ~15 min for the Ridge solve at d=256).

Contents:

... continue reading