We discovered why language models catastrophically fail on long conversations: when old tokens are removed to save memory, models produce complete gibberish. We found models dump massive attention onto the first few tokens as "attention sinks"—places to park unused attention since softmax requires weights to sum to 1. Our solution, StreamingLLM, simply keeps these first 4 tokens permanently while sliding the window for everything else, enabling stable processing of 4 million+ tokens instead of just thousands. This mechanism is now in HuggingFace, NVIDIA TensorRT-LLM, and OpenAI's latest models. This week, OpenAI made headlines by releasing their first open-source large language models, GPT-OSS-20B and GPT-OSS-120B. Buried in the technical documentation was a fascinating architectural detail: the inclusion of attention sink mechanisms. Their implementation adds a trainable scalar value to each attention head's softmax calculation: ‍ ‍ This simple modification—adding just one learnable parameter per attention head—enables the model to "pay no attention to any tokens" when needed, a design choice OpenAI's model card explicitly attributes to our StreamingLLM work. ‍ OpenAI's model card for GPT-OSS-20B explains the attention sink mechanism, directly connecting the design to our research. ‍ Seeing this feature in a major OpenAI release connected directly to research that began during my internship at Meta in the summer of 2023, when I was tasked with solving what seemed like a simple problem: How do you make a language model handle conversations longer than what it was trained for? This is the story of attention sinks: how we discovered this mechanism that every Transformer relies on, why it's crucial for model stability, and how this research has found its way into production AI systems. The Streaming Challenge In the beginning of the 2023 summer, I was presented with a fundamental question: How can we make a language model handle conversations longer than what it was trained for? The challenge affects any real-world AI application. Consider a chatbot engaged in an ongoing conversation—it needs to be aware of the recent context, but it can't afford to re-process the entire conversation history with every new word it generates. The computational cost would grow quadratically, making long conversations prohibitively expensive. Dense attention has a quadratic time complexity and an increasing cache size. It performance decreases when the text length exceeds the pre-training text length. ‍ The obvious solution seemed to be a sliding window approach: keep a fixed-size cache of the most recent tokens' internal states (their Key and Value vectors, known as the KV cache) and simply drop the oldest ones as the conversation grows. This approach is efficient and—as we discovered—fails spectacularly. Window Attention caches the most recent tokens’ KV. While efficient in inference, performance declines sharply oncethe starting tokens’ keys and values are evicted. Our experiments revealed a stark and unexpected result: the moment we removed the very first few tokens from the cache, the model's performance collapsed entirely. The perplexity—a distance between the model's predictions and the ground truths—skyrocketed. Models that had been generating perfectly coherent text suddenly began producing complete nonsense. The model's perplexity skyrockets the moment the initial tokens are evicted from the cache, indicating catastrophic failure. This was puzzling. The first few tokens often carried minimal semantic information—sometimes just a start-of-sequence marker or common words like "the" or "a." Why would removing these seemingly insignificant tokens cause such catastrophic failure? The Discovery: When Attention Gets into Sinks When I began visualizing the attention patterns inside models like Llama-2, a consistent and unexpected pattern emerged. Across most layers, large amounts of attention were being directed toward the very first tokens in the sequence. Regardless of what the current token was trying to predict, it would often "look back" toward the beginning. ‍ Attention visualization showing a heavy focus on the first tokens across multiple layers in LLaMA-2. ‍ The behavior reminded me of graph theory from my undergraduate studies. In directed graphs, a sink node is defined as a vertex with no outgoing connections—it receives flow from other nodes but doesn't pass it along. These initial tokens were behaving similarly: they were absorbing attention from across the sequence. This analogy led me to term them attention sinks. ‍ A visual representation of a sink in a directed graph. (Source: mathworld.wolfram.com) ‍ The Mathematical Foundation The mathematics behind this behavior reveals a fundamental constraint of the Transformer architecture. In the attention mechanism, we compute attention weights using the softmax function: The softmax function forces all attention weights to sum to exactly 1.0: ‍ ‍ This creates what Evan Miller eloquently described as a "deafening democracy where abstention is disallowed." Every attention head must allocate its focus somewhere, even when it has no meaningful information to contribute. When a token encounters no particularly relevant context to attend to, where does that mandatory attention "budget" go? In practice, some positions (especially the initial tokens) tend to have slightly higher baseline scores due to their consistent presence across training examples. This small bias gets amplified through training, causing these positions to evolve into specialized repositories for otherwise unused attention—functioning as computational pressure valves. Not a New Phenomenon This phenomenon wasn't entirely new. Researchers had observed similar patterns in BERT, where "a surprisingly large amount of attention focuses on the delimiter token [SEP] and periods," which they argued was used by the model as a sort of no-op. The same summer at Meta, researchers studying vision transformers found similar behavior, observing that models would repurpose uninformative background patches as computational scratchpads. ‍ Researchers observed similar sink-like behavior in both NLP models like BERT and vision models. ‍ The Revelation Our discovery suddenly illuminated why the sliding window approach failed so catastrophically. When we evicted those initial tokens, we weren't simply removing old context—we were removing a huge proportion of the denominator in the softmax function, dismantling the model's fundamental mechanism for maintaining attention stability. We had inadvertently removed its pressure valve. ‍ Illustration of why sliding window approaches fail when attention sinks are removed. ‍ The Simple Fix: Just Don't Throw Away the Sinks Understanding the problem led us to an almost embarrassingly simple solution: if these already-trained models desperately need attention sinks, why not just never throw them away? Our work, StreamingLLM, introduced a surprisingly straightforward modification to standard KV cache management. Instead of discarding the first few tokens like any others, we permanently preserve their Key and Value states while maintaining a sliding window for everything else. ‍ Diagram comparing a standard sliding window to the StreamingLLM approach, where the first few "sink" tokens are always kept in the cache. ‍ The implementation couldn't be simpler: # StreamingLLM: Always keep the first few tokens as attention sinks def get_streaming_kv_cache ( full_history, window_size= 1020 , sink_size= 4 ): # Never discard the first few tokens - they're our attention sinks sink_tokens = full_history[:sink_size] # Maintain a sliding window for the actual content recent_tokens = full_history[-(window_size - sink_size):] # Combine permanent sinks with recent content return sink_tokens + recent_tokens The results were remarkable. Models like LLaMA that previously collapsed after a few thousand tokens could now maintain stable perplexity across 4 million tokens—processing sequences orders of magnitude longer than their original training context. We had unlocked infinite-length generation simply by respecting the model's existing attention patterns. ‍ The perplexity of StreamingLLM remaining low and stable across 4 million tokens, while other methods fail. ‍ But Why Four Sinks? The Pre-Training Question This success raised an intriguing follow-up question: why did we need to preserve four attention sink tokens? Could we get away with just one? This curiosity led us to our pre-training experiments. We trained two identical 160-million parameter models from scratch—one with standard training, and another that included a dedicated [SINK] token at the start of every training sample. ‍ Adding a sink token during pre-training benefits the model's convergence trend and zero-shot performance across several NLP benchmarks. ‍ The results revealed something profound: the model trained with a dedicated sink token needed only one attention sink during streaming, while the vanilla model required four repurposed content tokens to maintain stability. Moreover, the model with the purpose-built sink actually converged slightly better during training. ‍ Training without dedicated sinks and with SoftMax-off-by-one (Zero Sink) still requires multiple initial tokens to stabilize performance—indicating multiple implicit sinks—whereas a single learnable sink token alone is sufficient. ‍ This experiment showed that models can learn to use purpose-built attention sinks more efficiently than hijacking existing content tokens—a finding that would influence how attention mechanisms are designed from the ground up. Two Paths to the Same Solution: Our Design vs. OpenAI's Interestingly, OpenAI's recent implementation represents a different approach to solving the same problem. Where we use a dedicated sink token k 0 at the sequence start: ‍ ‍ OpenAI simplified this with a universal scalar approach: ‍ ‍ The key difference: our approach lets different tokens have different relationships with the sink (some tokens might "need" the sink more than others), while OpenAI's treats the sink as a simple escape valve—the same for all tokens. Their design trades expressiveness for simplicity, eliminating the need for sink token embeddings while capturing the core insight that models need somewhere to dump unused attention. Both approaches solve the fundamental problem, but represent different philosophies: we treated the sink as a learnable component, while they engineered it as an architectural necessity. The Science Behind the Sink: Recent Discoveries Since our work, researchers have deepened our understanding of why attention sinks emerge and how they function. Barbero et al. have shown that attention sinks serve as "pressure valves" preventing what researchers call "over-mixing"—a pathological state where deep models processing long sequences blur important distinctions between tokens. The presence of a sink draws attention away from other tokens, limiting the spread of information (and noise) and resulting in more stable embeddings. This effect becomes more pronounced in larger models; LLaMA 3.1 405B shows attention sinks in a remarkable 80% of its attention heads. ‍ Research shows sinks slow the mixing of information, making Transformers more robust. A perturbation (red) spreads less with a sink (right) than without (left). (Source: Barbero et al.) ‍ Gu et al. has traced sink formation to the fundamental constraint of softmax normalization. As we noted, when attention weights must sum to 1.0, models need a default place to allocate their attention budget. Tellingly, replacing softmax with other attention operations that don't have this constraint prevents sinks from forming entirely. ‍ Table from Gu et al. showing how KV biases can prevent sink formation. ‍ Practical Applications These insights have inspired practical applications. Sun et al. directly introduced learnable key and value parameters (KV biases) into the attention mechanism, finding this design could alleviate the massive activations seen during inference—essentially the same approach as our learnable sink experiments. Building on this understanding, "CushionCache" by Son et al. uses deliberately designed attention sink prefixes that improve model quantization by reducing activation outliers. ‍ Diagram from Sun et al. showing the architectural addition of learnable KV biases. Techniques like CushionCache (Son et al.) use sinks to tame activation spikes, improving quantization. ‍ Given this connection between attention sinks and quantization stability, it's intriguing to speculate that OpenAI's built-in attention sink mechanism may partly enable the aggressive 4-bit weight quantization in their open-source models. OpenAI's aggressive 4-bit quantization in their open-source models likely benefits from the built-in attention sink mechanism, which helps prevent the activation outliers that typically plague extreme quantization. From Research to Reality What began as a practical engineering problem during my internship has evolved into a fundamental insight about Transformer architecture. The attention sink mechanism we discovered has since been adopted across the industry, appearing in production systems like OpenAI's models and inspiring new research directions in quantization and model optimization. The adoption happened remarkably quickly. By October 2023, Intel integrated StreamingLLM into their Extension for Transformers, enabling continuous LLM inference on CPUs with just 3 lines of code. December 2023 saw explosive adoption: HuggingFace integrated attention sinks into their main Transformers branch, and just days later, researchers from CMU, UW, and OctoAI demonstrated endless LLM generation running directly on iPhones, noting that "attention sinks are particularly helpful for longer generation with less memory requirement." The momentum continued into 2024 with NVIDIA incorporating StreamingLLM into TensorRT-LLM in January. Now, in August 2025, OpenAI released their open-source models with built-in attention sink parameters, bringing the mechanism full circle from research discovery to production implementation. Sometimes an impactful discovery emerges not from grand theoretical breakthroughs, but from carefully investigating the curious details that others might overlook. In our case, questioning why a few seemingly meaningless tokens were so critical led us to uncover a mechanism that every Transformer model relies on—one that was hiding in plain sight. References