SepLLM: Accelerate LLMs by Compressing One Segment into One Separator
Published on: 2025-06-29 09:27:26
Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks. However, their substantial sizes pose considerable challenges, particularly in terms of computational demands and inference speed, due to its quadratic complexity. In this work, we have identified a noteworthy pattern: certain meaningless special tokens (i.e., separators) contribute massively to attention scores compared to other semantically meaningful tokens. This insight has led us to hypothesize that the information of the segment between these special tokens can be condensed into these tokens without significant loss of information. Based on this hypothesis, we introduce SepLLM, a plug-and-play framework for inference acceleration by compressing these segments and dropping redundant tokens. Besides, we implement efficient kernels for training acceleration. The experimental results on training-free, training-from-scratch and post-training settings substantiate
... Read full article.