Tech News
← Back to articles

Kimi Linear: An Expressive, Efficient Attention Architecture

read original related products more articles

(a) On MMLU-Pro (4k context length), Kimi Linear achieves 51.0 performance with similar speed as full attention. On RULER (128k context length), it shows Pareto-optimal performance (84.3) and a 3.98x speedup. (b) Kimi Linear achieves 6.3x faster TPOT compared to MLA, offering significant speedups at long sequence lengths (1M tokens).

Overview

Kimi Linear is a hybrid linear attention architecture that outperforms traditional full attention methods across various contexts, including short, long, and reinforcement learning (RL) scaling regimes. At it's core is Kimi Delta Attention (KDA)β€”a refined version of Gated DeltaNet that introduces a more efficient gating mechanism to optimize the use of finite-state RNN memory.

Kimi Linear achieves superior performance and hardware efficiency, especially for long-context tasks. It reduces the need for large KV caches by up to 75% and boosts decoding throughput by up to $6\times$ for context as long as 1M tokens.

We open-sourced the KDA kernel in FLA, and released two versions model checkpoints trained with 5.7T tokens.

Model #Total Params #Activated Params Context Length Download Link Kimi-Linear-Base 48B 3B 1M πŸ€— Hugging Face Kimi-Linear-Instruct 48B 3B 1M πŸ€— Hugging Face

Key Features

Kimi Delta Attention (KDA): A linear attention mechanism that refines the gated delta rule with finegrained gating.

A linear attention mechanism that refines the gated delta rule with finegrained gating. Hybrid Architecture: A 3:1 KDA-to-global MLA ratio reduces memory usage while maintaining or surpassing the quality of full attention.

A 3:1 KDA-to-global MLA ratio reduces memory usage while maintaining or surpassing the quality of full attention. Superior Performance: Outperforms full attention in a variety of tasks, including long-context and RL-style benchmarks on 1.4T token training runs with fair comparisons.

... continue reading