Skip to content
Tech News
← Back to articles

KV Sharing, MHC, and Compressed Attention

read original more articles
Why This Matters

Recent advancements in LLM architectures focus on enhancing long-context efficiency through innovative techniques like KV sharing, compressed attention, and layer-wise attention management. These design improvements aim to reduce memory and computational costs, enabling larger and more capable models to process extended contexts more effectively, which is crucial for the evolving demands of AI applications.

Key Takeaways

After a short family break, I am excited to be back and catching up on a busy few weeks of open-weight LLM releases. The thing that stood out to me is how much newer architectures are focused on long-context efficiency.

As reasoning models and agent workflows keep more tokens around (for longer), KV-cache size, memory traffic, and attention cost quickly become the main constraints, and LLM developers are adding a growing number of architecture tricks to reduce those costs.

The main examples I want to look at are KV sharing and per-layer embeddings in Gemma 4, layer-wise attention budgeting in Laguna XS.2, compressed convolutional attention in ZAYA1-8B, and mHC plus compressed attention in DeepSeek V4.

Most of these changes look like small tweaks in my architecture diagrams, but some of them are quite intricate design changes that are worth a more detailed discussion.

Figure 1. LLM architecture drawings of recent, major open-weight releases (April to May). You can find the images, and more details, in my LLM architecture gallery . Not all model sizes are shown; Qwen3.6 includes the 27B and 35B-A3B variants, and ZAYA1 is represented by the 8B model (omitting ZAYA1-base and ZAYA1-reasoning-base). The architectures in the dotted boxes are covered in more detail in this article.

Note that this article is about architecture designs, so I will mostly skip dataset mixtures, training schedules, post-training details, RL recipes, benchmark tables, and product comparisons. Even with that narrower scope, there is a lot to cover. And, like always, the article turned out longer than I expected, so I will keep the focus on what changes inside the transformer block, residual stream, KV cache, or attention computation.

Please also note that I am only covering those topics that are interesting (new) design choices and that I haven’t covered elsewhere, yet. This list includes:

KV sharing and per-layer embeddings in Gemma 4 Compressed convolutional attention in ZAYA1 Attention budgeting in Laguna XS.2 mHC and compressed attention in DeepSeek V4

Previous Topics

Before getting into the new parts, here are the two previous articles I will refer back to. The first one gives a broader architecture background on recent MoE models, routed experts, active parameters, and model-size comparisons. The second one covers the attention background that comes up repeatedly below, including MHA, MQA, GQA, MLA, sliding-window attention, sparse attention, and hybrid attention designs.

... continue reading