Skip to content
Tech News
← Back to articles

Attention Residuals

read original more articles
Why This Matters

Attention Residuals (AttnRes) introduces a novel approach to residual connections in Transformers by enabling layers to selectively attend over earlier representations through learned, input-dependent attention. This method addresses issues of unbounded growth and contribution dilution in deep models, improving efficiency and performance, especially with the Block AttnRes variant that balances accuracy with memory constraints. Its adoption could lead to more scalable and effective Transformer architectures across various applications.

Key Takeaways

━━━━━━━━━━━━━━━━━━━━━━━━━━━

Attention Residuals

━━━━━━━━━━━━━━━━━━━━━━━━━━━

Paper | arXiv | Overview | Results | Citation

(a) Standard residuals with uniform additive accumulation. (b) Full AttnRes: each layer attends over all previous outputs. (c) Block AttnRes: layers are grouped into blocks, reducing memory from O(Ld) to O(Nd).

This is the official repository for Attention Residuals (AttnRes), a drop-in replacement for standard residual connections in Transformers that enables each layer to selectively aggregate earlier representations via learned, input-dependent attention over depth.

Overview

Standard residual connections accumulate all layer outputs with fixed unit weights. As depth grows, this uniform aggregation dilutes each layer's contribution and causes hidden-state magnitudes to grow unboundedly — a well-known problem with PreNorm.

AttnRes replaces this fixed accumulation with softmax attention over preceding layer outputs:

... continue reading