DeepSeek claims its new AI model can cut the cost of predictions by 75% - here's how

Nikolas Kokovlis/NurPhoto via Getty Images Follow ZDNET: Add us as a preferred source on Google. ZDNET key takeaways DeepSeek unveils a new AI model focused on cost efficiency. The main innovation is a reduction in compute to run attention. The innovation is not revolutionary; it's evolutionary. The Chinese artificial intelligence startup DeepSeek AI, which stunned the world in January with claims of dramatic cost efficiency for generative AI, is back with the latest twist on its use of the technology to drive down the price of computing. Last week, DeepSeek unveiled its latest research, DeepSeek-V3.2-Exp. On its corporate blog, the company claims the new model can cut the cost of making predictions, known as inference, by 75%, from $1.68 per million tokens to 42 cents. Also: DeepSeek may be about to shake up the AI world again - what we know As was the case in January, DeepSeek is drawing on techniques in the design of gen AI neural nets, which are part of a broad approach within deep-learning forms of AI, to squeeze more from computer chips by exploiting a phenomenon known as "sparsity." The magic of sparsity Sparsity is like a magic dial that finds the best match for your AI model and available compute. Sparsity comes in many forms. Sometimes, it involves eliminating data that doesn't materially affect the AI model's output. The same economic rule of thumb has been true for every new generation of personal computers: either a better result for the same money or the same result for less money. Also: What is sparsity? DeepSeek AI's secret, revealed by Apple researchers In its earlier work, DeepSeek used the sparsity approach of turning off large sections of neural network "weights" or "parameters" to reduce total computational cost. In the new work, as detailed in the technical paper posted on GitHub by DeepSeek researchers, the key is retraining the neural net to only pay attention to a subset of the data in its training data. Paying better attention One of the most expensive computing operations in training a neural network for applications, such as chatbots, is what's known as the "attention" mechanism. Attention compares each word you type to prior words, known as the context, and to a vocabulary of words the AI model has in its memory. The technical term for what you type at the prompt is the "query," and the words to compare to, or stored in memory, are known as "keys." When the attention mechanism finds a match between your query and a stored key, it can select what's called a "value" from the vocabulary to output as the next word or words. Also: Companies are making the same mistake with AI that Tesla made with robots The term "word" here is a shorthand for what goes on under the hood. As with all AI models, DeepSeek's program turns words, word fragments, letters, and punctuation into "tokens," which are atomic objects given a numeric value when stored in the tech company's vocabulary. The attention operation needs to compare a numeric score of the query token to every key token, which it does by matrix multiplication. As the tokens handled by a model grow in size -- and as more "context," recent tokens, are employed -- the compute cost grows exponentially. As an alternative approach, the researchers take the prior version of the AI model, DeepSeek-V3.1, "Terminus," and add what they call a "lightning indexer." In what is known as a "sparse training" procedure, they separately train both the V3.1 model and the lightning indexer from scratch. The V3.1 part has the normal attention mechanism. The lighting indexer doesn't and is instead trained to find a much smaller subset of tokens that are much more likely to be relevant from among the entire vocabulary of tokens. Lightning strikes The point of this approach is that, with a subset, the indexer can reduce the mass of query-key searches at prediction time, using only a select group, and thereby consume less compute power each time a prediction needs to be made. "Its computational efficiency is remarkable," the research authors said of the indexer. Also: OpenAI's Altman calls AI sector 'bubbly', but says we shouldn't worry - here's why The result of the lightning indexer is that their sparsity approach, which DeepSeek calls DeepSeek Sparse Attention, "requires much less computation" in their tests against V3.1, and results in "a significant end-to-end speedup in long-context scenarios." Moreover, the authors said: "We do not observe substantial performance degradation compared with DeepSeek-V3.1-Terminus, on both short- and long-context tasks" with respect to accuracy. Mind you, it's not only sparsity. There are a couple of other tweaks they used, including training V3.2 on domain-specific task data, such as for mathematics problems and coding. The authors said that more extensive real-world testing is necessary and is underway. Evolutionary not revolutionary Given the hype that has surrounded DeepSeek since January, it's worth keeping in mind that the lightning index and DeepSeek Parse Attention are simply the latest offerings in a long tradition of sparsity exploitation, as I pointed out in a previous article. For many years, researchers have specifically explored ways to reduce the computational burden of the key-value calculations. There have been numerous variants of attention used to reduce query-key cost, leading researchers to develop a taxonomy. The original attention method is referred to as "multi-head attention." Other approaches have been "multi-query attention," "grouped-query attention," and "flash attention." DeepSeek even has its own brand of attention in v3.1, which it preserves with V3.2, called "multi-head latent attention," an approach that brought benefits to 3.1. Given that there have been, and likely will continue to be, innovations to the attention mechanism from many parties, this DeepSeek innovation looks more evolutionary than revolutionary. Get the morning's top stories in your inbox each day with our Tech Today newsletter.

DeepSeek claims its new AI model can cut the cost of predictions by 75% - here's how

Share this article

Related Articles