Meta Superintelligence's surprising first paper

TL;DR MSI’s first paper, REFRAG, is about a new way to do RAG. This slightly modified LLM converts most retrieved document chunks into compact, LLM-aligned chunk embeddings that the LLM can consume directly. A lightweight policy (trained with RL) decides which chunk embeddings should be expanded back into full tokens under a budget; the LLM runs normally on this mixed input. The net effect is far less KV cache and attention cost, much faster first-byte latency and higher throughput, while preserving perplexity and task accuracy in benchmarks. Meta’s new Superintelligence labs made big headlines with its eyewatering salaries for researchers and leaders alike. Big name founders flocked to this new group, so when they published their first paper, we thought it was a good time to do a quick paper review. This is the paper: https://arxiv.org/abs/2509.01092 Our first thought was that the paper was on a topic that we didn’t expect: RAG. If you’re building or investing in products that rely on RAG, you might be building things like AI agents, LLM-powered search, customer support, summarization, or vertical agents. In these cases, your inference cost and latency are both massive drivers of your user experience and business model. A very intelligent model creates a better UX, but you run the risk of having CAC > LTV. A fast response, measured by things like time-to-first token, may be attractive, but it will typically mean needing a bigger inference machine. All of these impact whether your AI application is economically viable. This is precisely where Meta Superintelligence’s (MSI for short) first paper innovates. REFRAG promises 30x faster responses (specifically, time to first token) for existing RAG stacks. Why is this surprising? Well, this was surprising because we were expecting that MSI will publish papers that address improvements in the “model layer”, focused on foundational model performance. These maybe experiments that push us beyond scaling training datasets and using more compute for reasoning. New architectures. New modalities. But RAG is a very real world, practical topic for something as significant as a new lab’s first paper. Specifically, RAG is ‘different’ because enterprises and consumer apps have operational RAG pipelines with real revenue attached to them. Any improvements to cost/latency in these areas will lead to immediate ROI, also in a way where the benefits are very clear to application layer teams doing things with LLM, and less clear to foundational labs like MSI. The ROI comes from a few places. If we zoom in, on the UX level, faster responses increase retention. Cutting time-to-first-token (TTFT) multiplies effective capacity. Software-level efficiency opens new headroom without buying new GPUs or re-architecting models. Well, how does it work? In traditional RAG, you have a knowledge base, say a vector database of unstructured text in “documents”, and then you have a LLM that takes a user query, searches the knowledge base for relevant documents/chunks, and generates a response for the user. The main limitation here is the context window: LLMs can take a finite amount of information, up to millions of tokens as of the publishing of this article. In Meta’s REFRAG, documents for retrieval are still chunked (~128-token pieces). But, after that, each chunk is encoded into a compact chunk embedding by a lightweight encoder, then projected into the LLM embedding space. These embeddings are precomputable and cached. In the product, the user query is embedded and the system retrieves candidate chunks. Instead of sending every chunk’s full token stream to the LLM, the system feeds the LLM a mixture of: (a) projected chunk embeddings for most chunks, and (b) full token sequences for a few chunks that the policy picks to expand. A small policy network, trained to maximize downstream generation quality under an expansion budget, looks at the chunk embeddings and picks which chunks to expand to tokens. This policy can be trained with an RL objective that rewards reduced perplexity on generation. The LLM sees a short token sequence (expanded chunks + query) plus a number of single-vector placeholders (unexpanded chunks). LLM proceeds to generate text as normal. In the paper, the core insight is stated to be using the policy network to compress less-relevant chunks in the RAG process, but to us, the core insight here is actually: if embeddings are generated by layers within the LLM, it makes no sense to convert them back to natural language, just for another LLM to compress those tokens back to embeddings. That is why the speedups come without collapsing accuracy. Where this sits in the current AI value chain Contrast two vectors of innovation in LLM land: Model-level breakthroughs (new architectures, larger models, novel pretraining): high-risk, high-reward, long timelines, big capital. Application/system-level efficiency (inference optimizations, retrieval tricks, orchestration): lower-risk, immediate ROI, directly monetizable. MSI publishing a RAG efficiency paper, in our opinion, signals a broader direction for them to go after problems with ROI today, where their research and infra expertise can move the needle. For enterprises and product teams, this is a ripe candidate for production pilots. Evaluate TTFT, throughput, and cost-per-query before and after. The upside is immediate: more queries per GPU, lower infra spend, and better UX. You can also mix-and-match stacks. REFRAG is orthogonal to better retrievers or rerankers, you can combine it with stronger rerankers to shrink the candidate set even more. In the broader market, it’s an interesting paper during an even more interesting time in the vector DB space. Leading vector DB Pinecone is rumoured to be exploring a sale and there was a founder-operator CEO transition. Fresh research from DeepMind, "On the Theoretical Limitations of Embedding-Based Retrieval," highlights how some documents are always out of reach for RAG. Deedy Das of Menlo Ventures, called this proof that “plain old BM25 from 1994 outperforms vector search on recall,” While we don’t have a working implementation of REFRAG yet, we can see/guess some limitations: Training & engineering complexity. You must add an encoder + projection and train it so the LLM understands embeddings (reconstruction pretraining + SFT). The selective-policy is an RL problem: stable but adding development complexity. Compression ceiling. Aggressive compression eventually degrades downstream quality. There’s a tradeoff between how small your embeddings are and how often you must expand. Freshness. Precomputed chunk embeddings are great for static corpora. For frequently changing data you need pipelines to recompute embeddings or rely on hybrid strategies. Use cases. Summaries are coarse; certain precision-critical tasks (legal reasoning, exact quoting, sensitive medical facts) need careful evaluation, you may need lower compression budgets there. Our predictions A few closing thoughts. This paper seems to say, “Why optimise token costs, when you can use a completely different kind of token?” If LLMs can be embedding native on the READ side, can it also be embedding native on the WRITE side, thus accelerating agents 30x overall? Cost per token for an embedding model is almost zero – have we just saved a ton of token prices by moving to a different architecture? What’s the catch? REFRAG is a reminder that not all breakthroughs come from bigger models. Making RAG cheaper and faster at scale is a direct lever on product economics, and the industry will reward teams that operationalize these wins.

Meta Superintelligence's surprising first paper

Share this article

Related Articles