Language Models Pack Billions of Concepts into 12,000 Dimensions

In a recent 3Blue1Brown video series on transformer models, Grant Sanderson posed a fascinating question: How can a relatively modest embedding space of 12,288 dimensions (GPT-3) accommodate millions of distinct real-world concepts?

The answer lies at the intersection of high-dimensional geometry and a remarkable mathematical result known as the Johnson-Lindenstrauss lemma. While exploring this question, I discovered something unexpected that led to an interesting collaboration with Grant and a deeper understanding of vector space geometry.

The key insight begins with a simple observation: while an N-dimensional space can only hold N perfectly orthogonal vectors, relaxing this constraint to allow for "quasi-orthogonal" relationships (vectors at angles of, say, 85-95 degrees) dramatically increases the space's capacity. This property is crucial for understanding how language models can efficiently encode semantic meaning in relatively compact embedding spaces.

In Grant's video, he demonstrated this principle with an experiment attempting to fit 10,000 unit vectors into a 100-dimensional space while maintaining near-orthogonal relationships. The visualization suggested success, showing angles clustered between 89-91 degrees. However, when I implemented the code myself, I noticed something interesting about the optimization process.

The original loss function was elegantly simple:

loss = (dot_products.abs()).relu().sum()

While this loss function appears perfect for an unbounded ℝᴺ space, it encounters two subtle but critical issues when applied to vectors constrained to a high-dimensional unit sphere:

The Gradient Trap: The dot product between vectors is the cosine of the angle between them, and the gradient is the sine of this angle. This creates a perverse incentive structure: when vectors approach the desired 90-degree relationship, the gradient (sin(90°) = 1.0) strongly pushes toward improvement. However, when vectors drift far from the goal (near 0° or 180°), the gradient (sin(0°) ≈ 0) vanishes—effectively trapping these badly aligned vectors in their poor configuration. The 99% Solution: The optimizer discovered a statistically favorable but geometrically perverse solution. For each vector, it would be properly orthogonal to 9,900 out of 9,999 other vectors while being nearly parallel to just 99. This configuration, while clearly not the intended outcome, actually represented a global minimum for the loss function—mathematically similar to taking 100 orthogonal basis vectors and replicating each one roughly 100 times.

This stable configuration was particularly insidious because it satisfied 99% of the constraints while being fundamentally different from the desired constellation of evenly spaced, quasi-orthogonal vectors. To address this, I modified the loss function to use an exponential penalty that increases aggressively as dot products grow:

... continue reading