DeepSeek's mHC: When Residual Connections Explode
Every transformer you’ve ever used has the same residual connection design from 2016.
GPT-5, Claude, Llama, Gemini. Under the hood, they all do the same thing: x + F ( x ) x + F(x) x+F(x). One stream of information flowing through the network, with each layer adding to it.
DeepSeek asked: what if it was wider?
The Setup
Standard residual connections are the backbone of every modern transformer. The idea is simple:
x l + 1 = x l + F ( x l ) x_{l+1} = x_l + F(x_l) x l + 1 = x l + F ( x l )
The input flows through unchanged, plus the layer’s output. One stream of information. What goes in comes out, plus a learned update. This is why transformers can be hundreds of layers deep: the gradient has a clean path backward. Simple. Stable. Unchanged since 2016.
Hyper-Connections take a different approach. Instead of one stream, expand to n parallel streams with learnable mixing matrices:
x l + 1 = H l r e s x l + H l p o s t , T F ( H l p r e x l , W l ) x_{l+1} = H^{res}_l x_l + H^{post,T}_l F(H^{pre}_l x_l, W_l) x l + 1 = H l res x l + H l p os t , T F ( H l p re x l , W l )
... continue reading