Deepseek research touts memory breakthrough, decoupling compute power and RAM pools to bypass GPU & HBM constraints — Engram conditional memory module commits static knowledge to system RAM

DeepSeek has released a new technical paper, which details a new method for how new AI models might rely on a queryable database of information committed to system memory. Named "Engram", the conditional memory-based technique achieves demonstrably higher performance in long-context queries by committing sequences of data to static memory. This eases the reliance on reasoning for AI models, allowing the GPUs to only handle more complex tasks, increasing performance, and reducing the reliance on high-bandwidth memory (HBM).

The paper details how N-grams, statistical sequences of words, are integrated into the model's neural networks, allowing them to be placed into a queryable memory bank. Engram allows models to remember facts, rather than having to reason them out, which is more computationally expensive.

Released on the company's GitHub page, Engram hopes to address how the company might be able to curb the reliance on more complex memory types and instead commit a knowledge library to a more common system memory standard, such as CXL.

Reducing the reliance on HBM

The ongoing reliance on high-bandwidth memory for AI accelerators is something that even Chinese silicon, such as Huawei's Ascend series, cannot escape. Each stack of HBM uses more memory dies, and with demand skyrocketing, easing any AI model's reliance on the GPU's direct high-bandwidth memory would be significant, especially considering the ongoing memory supply squeeze.

Engram would enable static memory to be held separately from an LLM's compute power, allowing the GPU's rapid HBM to dedicate itself to reasoning, therefore enabling more performant Engram-based AI models, compared to a standard Mixture of Experts (MoE) model.

As detailed in the paper, an Engram-based model scaled to nearly 27 billion parameters can beat out a standard MoE model in long-context training and eliminates computational waste generated by having to reason out facts, by allowing them to be externally stored.

A standard MoE model might have to reconstruct these pieces of data every time it's referenced in a query, which is called conditional computation. The model will then call on its expert parameters to assemble and reason the data every time, even when it only focuses the query on certain parts or experts, named sparse computation.

How Engram embeds itself into training and inference workloads (Image credit: Deepseek)

The Engram paper adds that placing conditional memory would allow the model to merely ask: "Do I already have this data?", rather than having to access the parts of the model that deal with reasoning.

... continue reading