Attention Residuals

━━━━━━━━━━━━━━━━━━━━━━━━━━━

Attention Residuals

━━━━━━━━━━━━━━━━━━━━━━━━━━━

Paper | arXiv | Overview | Results | Citation

(a) Standard residuals with uniform additive accumulation. (b) Full AttnRes: each layer attends over all previous outputs. (c) Block AttnRes: layers are grouped into blocks, reducing memory from O(Ld) to O(Nd).

This is the official repository for Attention Residuals (AttnRes), a drop-in replacement for standard residual connections in Transformers that enables each layer to selectively aggregate earlier representations via learned, input-dependent attention over depth.

Overview

Standard residual connections accumulate all layer outputs with fixed unit weights. As depth grows, this uniform aggregation dilutes each layer's contribution and causes hidden-state magnitudes to grow unboundedly — a well-known problem with PreNorm.

AttnRes replaces this fixed accumulation with softmax attention over preceding layer outputs:

... continue reading