Using search systems in conjunction with a large language model (LLM) is a common paradigm for enabling language models to access data beyond their training corpus. This approach, broadly known as retrieval-augmented-generation (RAG), has traditionally relied on single-stage retrieval pipelines composed of vector search, lexical search, or regular expression matching, optionally followed by a learned reranker. While effective for straightforward lookup queries, these pipelines are fundamentally limited: they assume that the information needed to answer a question can be retrieved in a single pass.
In practice, many real-world queries are not satisfiable in a single-stage. Answering a question often requires a chain of intermediate searches in which the output of one search informs the next, a process known as a multi-hop retrieval.
To solve this, leveraging LLMs for multi-turn agentic search has become a viable approach to answering multi-hop retrieval queries. Rather than issuing a single query, an LLM agent iteratively decomposes a high-level question into subqueries, retrieves evidence, and refines its search strategy across multiple turns. Concurrently, it has been shown that smaller-parameter language models, trained on moderate-scale corpora, can serve as effective search agents with performance comparable to substantially larger models. Running frontier-scale models for multi-turn search incurs high cost and latency, which motivates offloading this task to a smaller, purpose-trained model.
A key factor driving the cost and latency of agentic search is the growth of the context window. As the agent gathers information over multiple turns, its context window fills rapidly with retrieved documents, many of which may be tangential or redundant. This bloated context not only increases computational cost but can also degrade downstream performance due to increasing the presence of distracting information. One promising direction to address this is self-editing context, in which the agent actively decides which retrieved information to retain and which to discard, allowing it to continue long-horizon search tasks more efficiently and more accurately within a bounded context window.
Building on these insights, we trained Chroma Context-1, a 20B parameter agentic search model on over eight thousand synthetically generated tasks. Context-1 achieves retrieval performance comparable to frontier LLMs at a fraction of the cost and up to 10x the inference speed. Context-1 operates as a retrieval subagent: rather than answering questions directly, it returns a ranked set of supporting documents to a downstream answering model, cleanly separating search from generation. The model is trained to decompose a high-level query into subqueries and iteratively search a corpus across multiple turns. As the agent's context window fills, it selectively discards irrelevant results to free capacity and reduce noise for further exploration.
In this work we present our synthetic data generation pipeline, agent harness, and training methodology alongside a comprehensive evaluation of Context-1 across a range of retrieval benchmarks. Our results demonstrate that a purpose-trained 20B model can reach the Pareto frontier of retrieval performance with respect to cost and latency, matching or exceeding frontier models that are orders of magnitude larger at a fraction of the compute.
Key Techniques#
We present the following:
A staged training curriculum that first optimizes for recall before shifting toward precision, training the agent to progressively narrow from broad retrieval to selective retention. We release the weights of this model to the public under a permissive Apache 2.0 license.
A context management strategy in which the agent selectively edits its own context during search, discarding irrelevant passages to free context capacity for further exploration and to reduce the effects of context rot.
... continue reading