Skip to content
Tech News
← Back to articles

MSA: Memory Sparse Attention

read original get Sparse Attention Transformer Model → more articles
Why This Matters

Memory Sparse Attention (MSA) introduces a scalable, end-to-end trainable framework that significantly extends the effective context length of large language models to 100 million tokens. By leveraging sparse attention and efficient memory management, MSA overcomes traditional bottlenecks, enabling more accurate and context-aware AI applications at unprecedented scales. This advancement has the potential to transform long-term reasoning, knowledge retention, and complex task handling in AI systems, benefiting both the industry and consumers.

Key Takeaways

MSA: Memory Sparse Attention

A scalable, end-to-end trainable latent-memory framework for 100M-token contexts

Paper • [Code](Coming Soon) • [Models](Coming Soon)

📝 Abstract

Long-term memory is essential for general intelligence, yet the full attention bottleneck constrains most LLMs’ effective context length to 128K–1M. Existing attempts,hybrid linear attention, fixed-size state memory (e.g., RNNs), and external storage like RAG/agents,either suffer rapid precision decay and latency growth at extreme scales, lack end-to-end differentiability or dynamic memory maintenance, or require complex pipelines. We present Memory Sparse Attention (MSA): an end-to-end trainable, scalable sparse latent-state memory framework. Core ideas include:

Scalable sparse attention + document-wise RoPE (parallel/global) achieving near-linear complexity in both training and inference;

+ (parallel/global) achieving in both training and inference; KV cache compression with a Memory Parallel inference engine to deliver 100M token throughput on 2×A800 GPUs;

with a inference engine to deliver throughput on GPUs; Memory Interleave for multi-round, multi-hop reasoning across scattered memory segments.

On long-context QA and NIAH (Needle-in-a-Haystack) benchmarks, MSA surpasses same-backbone RAG, best-of-breed RAG stacks, and leading long-context models. Across an unprecedented 16K→100M token range, MSA shows < 9% degradation, suggesting a practical path to decouple memory capacity from reasoning.

Scaling from 16K→100M tokens: MSA fuses top-k selection with sparse attention to remain end-to-end differentiable while allowing document decoupling at inference. On MS MARCO, MSA sustains <9% degradation and exhibits strong extrapolation.

... continue reading