Tech News
← Back to articles

Starting from scratch: Training a 30M Topological Transformer

read original related products more articles

Training a 30M parameters Topological Transformer

Tauformer is a topological transformer (see paper) that replaces dot‑product attention with a Laplacian-derived scalar (taumode) per token/head, then attends using distances in that scalar space. Below is a post-style overview of the idea and the first training signals from a 30M-parameter run.

Tauformer in one idea

Tauformer’s goal is to inject domain structure directly into attention by using a Graph Laplacian built from a domain embedding space (a “domain memory”) as a persistent reference. Instead of ranking keys by \(Q\cdot K\), Tauformer ranks them by how similar their Laplacian-derived taumode scalars are, which is intended to bias attention toward domain-relevant relations rather than generic geometric similarity.

At the implementation level, Tauformer keeps the familiar Q/K/V projections, RoPE, causal masking, and stable softmax/value aggregation pipeline, but changes how attention logits are computed. Each head vector is compressed into a scalar \(\lambda\) using a bounded Rayleigh-quotient energy computed with a feature-space Laplacian \(L\), then logits are computed as a negative distance \(-|\lambda_q-\lambda_k|/\text{temperature}\).

Key building blocks (as implemented):

Taumode scalar: compute \(E_{\text{raw}}=(x^\top L x)/(x^\top x+\varepsilon)\), then bound it as \(E_{\text{raw}}/(E_{\text{raw}}+\tau)\) to produce \(\lambda\in[0,1)\).

Logits: \(\text{att}_{ij} = -\|\lambda^Q_i - \lambda^K_j\|/\text{temperature}\), then reuse causal mask \(→\) subtract row max \(→\) softmax \(→\) multiply by \(V\).

Why it can be cheaper

Because scoring no longer needs full key vectors, Tauformer’s KV-cache can store values plus a compact key-side scalar stream rather than both K and V tensors. Concretely, the cache payload is \((V,\lambda_k)\) (not \((K,V)\)), which yields an approximate ~50% per-layer cache reduction for typical head dimensions (small overhead for storing the extra scalar).

... continue reading