From multi-head to latent attention: The evolution of attention mechanisms

From Multi-Head to Latent Attention: The Evolution of Attention Mechanisms Vinithavn 7 min read · 15 hours ago 15 hours ago -- Listen Share

Press enter or click to view image in full size

What is attention?

In any autoregressive model, the prediction of the future tokens is based on some preceding context. However, not all the tokens within this context equally contribute to the prediction, because some tokens might be more relevant than others. The attention mechanism addresses this by allowing the model to concentrate on the important context words selectively, while generating each output word or token. Consider the popular example that explains the attention mechanism.

“The animal didn’t cross the street because it was too tired”.

In this sentence, the pronoun “it” could refer to either “animal” or “street”. Attention helps the model to associate “it” with “animal” rather than “street” by weighing the relative importance of each word. This helps the model to understand the relationships between words and capture the contextual meaning in various NLP tasks.

How is attention calculated?

There are various types of attention mechanisms today, beginning with the Multi-Head Attention (MHA), which introduced the attention concept in the seminal paper. More recently, advanced variants like Multi-Latent Head Attention (MHLA) have been employed in popular models like Deepseek. This blog aims to cover the fundamentals of each attention mechanism, including the core ideas, advantages, limitations, etc.

Key Concepts in Attention Mechanisms

Before diving into specific types of attention, we need to understand some fundamental concepts that underpin all the various attention mechanisms.

... continue reading