━━━━━━━━━━━━━━━━━━━━━━━━━━━
Attention Residuals
━━━━━━━━━━━━━━━━━━━━━━━━━━━
Paper | arXiv | Overview | Results | Citation
(a) Standard residuals with uniform additive accumulation. (b) Full AttnRes: each layer attends over all previous outputs. (c) Block AttnRes: layers are grouped into blocks, reducing memory from O(Ld) to O(Nd).
This is the official repository for Attention Residuals (AttnRes), a drop-in replacement for standard residual connections in Transformers that enables each layer to selectively aggregate earlier representations via learned, input-dependent attention over depth.
Overview
Standard residual connections accumulate all layer outputs with fixed unit weights. As depth grows, this uniform aggregation dilutes each layer's contribution and causes hidden-state magnitudes to grow unboundedly — a well-known problem with PreNorm.
AttnRes replaces this fixed accumulation with softmax attention over preceding layer outputs:
... continue reading