I’ve been working in AI and search for a decade. First building Doctrine, the largest European legal search engine and now building Fintool, an AI-powered financial research platform that helps institutional investors analyze companies, screen stocks, and make investment decisions.
After three years of building, optimizing, and scaling LLMs with retrieval-augmented generation (RAG) systems, I believe we’re witnessing the twilight of RAG-based architectures. As context windows explode and agent-based architectures mature, my controversial opinion is that the current RAG infrastructure we spent so much time building and optimizing is on the decline.
The Rise of Retrieval-Augmented Generation
In late 2022, ChatGPT took the world by storm. People started endless conversations, delegating crucial work only to realize that the underlying model, GPT-3.5 could only handle 4,096 tokens... roughly six pages of text!
The AI world faced a fundamental problem: how do you make an intelligent system work with knowledge bases that are orders of magnitude larger than what it can read at once?
The answer became Retrieval-Augmented Generation (RAG), an architectural pattern that would dominate AI for the next three years.
The Mathematical Reality of Early LLMs
GPT-3.5 could handle 4,096 token and the next model GPT-4 doubled it to 8,192 tokens, about twelve pages. This wasn’t just inconvenient; it was architecturally devastating.
Consider the numbers: A single SEC 10-K filing contains approximately 51,000 tokens (130+ pages).
With 8,192 tokens, you could see less than 16% of a 10-K filing. It’s like reading a financial report through a keyhole!
... continue reading