Meta Superintelligence Labs' first paper is about RAG

TL;DR

MSI’s first paper, REFRAG, is about a new way to do RAG.

This slightly modified LLM converts most retrieved document chunks into compact, LLM-aligned chunk embeddings that the LLM can consume directly.

A lightweight policy (trained with RL) decides which chunk embeddings should be expanded back into full tokens under a budget; the LLM runs normally on this mixed input.

The net effect is far less KV cache and attention cost, much faster first-byte latency and higher throughput, while preserving perplexity and task accuracy in benchmarks.

Meta’s new Superintelligence labs made big headlines with its eyewatering salaries for researchers and leaders alike. Big name founders flocked to this new group, so when they published their first paper, we thought it was a good time to do a quick paper review.

This is the paper: https://arxiv.org/abs/2509.01092

Our first thought was that the paper was on a topic that we didn’t expect: RAG. If you’re building or investing in products that rely on RAG, you might be building things like AI agents, LLM-powered search, customer support, summarization, or vertical agents.

In these cases, your inference cost and latency are both massive drivers of your user experience and business model. A very intelligent model creates a better UX, but you run the risk of having CAC > LTV. A fast response, measured by things like time-to-first token, may be attractive, but it will typically mean needing a bigger inference machine. All of these impact whether your AI application is economically viable.

This is precisely where Meta Superintelligence’s (MSI for short) first paper innovates. REFRAG promises 30x faster responses (specifically, time to first token) for existing RAG stacks.

... continue reading