The RAG Obituary: Killed by agents, buried by context windows

I’ve been working in AI and search for a decade. First building Doctrine, the largest European legal search engine and now building Fintool, an AI-powered financial research platform that helps institutional investors analyze companies, screen stocks, and make investment decisions.

After three years of building, optimizing, and scaling LLMs with retrieval-augmented generation (RAG) systems, I believe we’re witnessing the twilight of RAG-based architectures. As context windows explode and agent-based architectures mature, my controversial opinion is that the current RAG infrastructure we spent so much time building and optimizing is on the decline.

The Rise of Retrieval-Augmented Generation

In late 2022, ChatGPT took the world by storm. People started endless conversations, delegating crucial work only to realize that the underlying model, GPT-3.5 could only handle 4,096 tokens... roughly six pages of text!

The AI world faced a fundamental problem: how do you make an intelligent system work with knowledge bases that are orders of magnitude larger than what it can read at once?

The answer became Retrieval-Augmented Generation (RAG), an architectural pattern that would dominate AI for the next three years.

The Mathematical Reality of Early LLMs

GPT-3.5 could handle 4,096 token and the next model GPT-4 doubled it to 8,192 tokens, about twelve pages. This wasn’t just inconvenient; it was architecturally devastating.

Consider the numbers: A single SEC 10-K filing contains approximately 51,000 tokens (130+ pages).

With 8,192 tokens, you could see less than 16% of a 10-K filing. It’s like reading a financial report through a keyhole!

... continue reading