GAM takes aim at “context rot”: A dual-agent memory architecture that outperforms long-context LLMs

For all their superhuman power, today’s AI models suffer from a surprisingly human flaw: They forget. Give an AI assistant a sprawling conversation, a multi-step reasoning task or a project spanning days, and it will eventually lose the thread. Engineers refer to this phenomenon as “context rot,” and it has quietly become one of the most significant obstacles to building AI agents that can function reliably in the real world.A research team from China and Hong Kong believes it has created a solution to context rot. Their new paper introduces general agentic memory (GAM), a system built to preserve long-horizon information without overwhelming the model. The core premise is simple: Split memory into two specialized roles, one that captures everything, another that retrieves exactly the right things at the right moment.Early results are encouraging, and couldn’t be better timed. As the industry moves beyond prompt engineering and embraces the broader discipline of context engineering, GAM is emerging at precisely the right inflection point.When bigger context windows still aren’t enoughAt the heart of every large language model (LLM) lies a rigid limitation: A fixed “working memory,” more commonly referred to as the context window. Once conversations grow long, older information gets truncated, summarized or silently dropped. This limitation has long been recognized by AI researchers, and since early 2023, developers have been working to expand context windows, rapidly increasing the amount of information a model can handle in a single pass.Mistral’s Mixtral 8x7B debuted with a 32K-token window, which is approximately 24 to 25 words, or about 128 characters in English; essentially a small amount of text, like a single sentence. This was followed by MosaicML’s MPT-7B-StoryWriter-65k+, which more than doubled that capacity; then came Google’s Gemini 1.5 Pro and Anthropic’s Claude 3, offering massive 128K and 200K windows, both of which are extendable to an unprecedented one million tokens. Even Microsoft joined the push, vaulting from the 2K-token limit of the earlier Phi models to the 128K context window of Phi-3. Increasing context windows might sound like the obvious fix, but it isn’t. Even models with sprawling 100K-token windows, enough to hold hundreds of pages of text, still struggle to recall details buried near the beginning of a long conversation. Scaling context comes with its own set of problems. As prompts grow longer, models become less reliable at locating and interpreting information because attention over distant tokens weakens and accuracy gradually erodes.Longer inputs also dilute the signal-to-noise ratio, as including every possible detail can actually make responses worse than using a focused prompt. Long prompts also slow models down; more input tokens lead to noticeably higher output-token latency, creating a practical limit on how much context can be used before performance suffers.Memories are pricelessFor most organizations, supersized context windows come with a clear downside — they’re costly. Sending massive prompts through an API is never cheap, and because pricing scales directly with input tokens, even a single bloated request can drive up expenses. Prompt caching helps, but not enough to offset the habit of routinely overloading models with unnecessary context. And that’s the tension at the heart of the issue: Memory is essential to making AI more powerful.As context windows stretch into the hundreds of thousands or millions of tokens, the financial overhead rises just as sharply. Scaling context is both a technical challenge and an economic one, and relying on ever-larger windows quickly becomes an unsustainable strategy for long-term memory.Fixes like summarization and retrieval-augmented generation (RAG) aren’t silver bullets either. Summaries inevitably strip away subtle but important details, and traditional RAG, while strong on static documents, tends to break down when information stretches across multiple sessions or evolves over time. Even newer variants, such as agentic RAG and RAG 2.0 (which perform better in steering the retrieval process), still inherit the same foundational flaw of treating retrieval as the solution, rather than treating memory itself as the core problem.Compilers solved this problem decades agoIf memory is the real bottleneck, and retrieval can’t fix it, then the gap needs a different kind of solution. That’s the bet behind GAM. Instead of pretending retrieval is memory, GAM keeps a full, lossless record and layers smart, on-demand recall on top of it, resurfacing the exact details an agent needs even as conversations twist and evolve. A useful way to understand GAM is through a familiar idea from software engineering: Just-in-time (JIT) compilation. Rather than precomputing a rigid, heavily compressed memory, GAM keeps things light and tight by storing a minimal set of cues, along with a full, untouched archive of raw history. Then, when a request arrives, it “compiles” a tailored context on the fly.This JIT approach is built into GAM’s dual architecture, allowing AI to carry context across long conversations without overcompressing or guessing too early about what matters. The result is the right information, delivered at exactly the right moment.Inside GAM: A two-agent system built for memory that enduresGAM revolves around the simple idea of separating the act of remembering from recalling, which aptly involves two components: The 'memorizer' and the 'researcher.'The memorizer: Total recall without overloadThe memorizer captures every exchange in full, quietly turning each interaction into a concise memo while preserving the complete, decorated session in a searchable page store. It doesn’t compress aggressively or guess what is important. Instead, it organizes interactions into structured pages, adds metadata for efficient retrieval and generates optional lightweight summaries for quick scanning. Critically, every detail is preserved, and nothing is thrown away.The researcher: A deep retrieval engineWhen the agent needs to act, the researcher takes the helm to plan a search strategy, combining embeddings with keyword methods like BM25, navigating through page IDs and stitching the pieces together. It conducts layered searches across the page-store, blending vector retrieval, keyword matching and direct lookups. It evaluates findings, identifies gaps and continues searching until it has sufficient evidence to produce a confident answer, much like a human analyst reviewing old notes and primary documents. It iterates, searches, integrates and reflects until it builds a clean, task-specific briefing. GAM’s power comes from this JIT memory pipeline, which assembles rich, task-specific context on demand instead of leaning on brittle, precomputed summaries. Its core innovation is simple yet powerful, as it preserves all information intact and makes every detail recoverable.Ablation studies support this approach: Traditional memory fails on its own, and naive retrieval isn’t enough. It’s the pairing of a complete archive with an active, iterative research engine that enables GAM to surface details that other systems leave behind.Outperforming RAG and long-context modelsTo test GAM, the researchers pitted it against standard RAG pipelines and models with enlarged context windows such as GPT-4o-mini and Qwen2.5-14B. They evaluated GAM using four major long-context and memory-intensive benchmarks, each chosen to test a different aspect of the system’s capabilities:LoCoMo measures an agent’s ability to maintain and recall information across long, multi-session conversations, encompassing single-hop, multi-hop, temporal reasoning and open-domain tasks.HotpotQA, a widely used multi-hop QA benchmark built from Wikipedia, was adapted using MemAgent’s memory-stress-test version, which mixes relevant documents with distractors to create contexts of 56K, 224K and 448K tokens — ideal for testing how well GAM handles noisy, sprawling input.RULER evaluates retrieval accuracy, multi-hop state tracking, aggregation over long sequences and QA performance under a 128K-token context to further probe long-horizon reasoning.NarrativeQA is a benchmark where each question must be answered using the full text of a book or movie script; the researchers sampled 300 examples with an average context size of 87K tokens.Together, these datasets and benchmarks allowed the team to assess both GAM’s ability to preserve detailed historical information and its effectiveness in supporting complex downstream reasoning tasks.GAM came out ahead across all benchmarks. Its biggest win was on RULER, which benchmarks long-range state tracking. Notably: GAM exceeded 90% accuracy.RAG collapsed because key details were lost in summaries.Long-context models faltered as older information effectively “faded” even when technically present.Clearly, bigger context windows aren’t the answer. GAM works because it retrieves with precision rather than piling up tokens.GAM, context engineering and competing approachesPoorly structured context, not model limitations, is often the real reason AI agents fail. GAM addresses this by ensuring that nothing is permanently lost and that the right information can always be retrieved, even far downstream. The technique’s emergence coincides with the current, broader shift in AI towards context engineering, or the practice of shaping everything an AI model sees — its instructions, history, retrieved documents, tools, preferences and output formats.Context engineering has rapidly eclipsed prompt engineering in importance, although other research groups are tackling the memory problem from different angles. Anthropic is exploring curated, evolving context states. DeepSeek is experimenting with storing memory as images. Another group of Chinese researchers has proposed “semantic operating systems” built around lifelong adaptive memory.However, GAM’s philosophy is distinct: Avoid loss and retrieve with intelligence. Instead of guessing what will matter later, it keeps everything and uses a dedicated research engine to find the relevant pieces at runtime. For agents handling multi-day projects, ongoing workflows or long-term relationships, that reliability may prove essential.Why GAM matters for the long haulJust as adding more compute doesn’t automatically produce better algorithms, expanding context windows alone won’t solve AI’s long-term memory problems. Meaningful progress requires rethinking the underlying system, and GAM takes that approach. Instead of depending on ever-larger models, massive context windows or endlessly refined prompts, it treats memory as an engineering challenge — one that benefits from structure rather than brute force.As AI agents transition from clever demos to mission-critical tools, their ability to remember long histories becomes crucial for developing dependable, intelligent systems. Enterprises require AI agents that can track evolving tasks, maintain continuity and recall past interactions with precision and accuracy. GAM offers a practical path toward that future, signaling what may be the next major frontier in AI: Not bigger models, but smarter memory systems and the context architectures that make them possible.