Tech News
← Back to articles

Synthesizing scientific literature with retrieval-augmented language models

read original related products more articles

OpenScholar

OpenScholar (detailed in Extended Data Fig. 1) is a new retrieval-augmented LM designed to ensure reliable, high-quality responses to a range of information-seeking queries about scientific literature.

Task formulation and challenges

Given a scientific query x, the task is to identify relevant papers, synthesize their findings and generate a response y that effectively addresses the query. This response should be accompanied by a set of citations, C = c 1 , c 2 ,…, c K , in which each citation c i corresponds to an existing scientific paper. Each c i in C corresponds to specific passages from scientific literature and should be provided as an in-line citation, linked to the relevant spans of text in y, following standard practice in scientific writing. These citations allow researchers to trace the output back to the original literature, ensuring transparency and verifiability.

However, this task presents several challenges: (1) retrieving high-recall, high-precision scientific content from a vast, domain-specific corpus; (2) synthesizing accurate, non-hallucinated responses grounded in the retrieved evidence; and (3) producing citation-aware outputs that align generated text with appropriate references at a fine-grained level. A further challenge lies in the scarcity of resources: to our knowledge, there is limited availability of large-scale, up-to-date scientific corpora, especially those suitable for dense retrieval, as well as a lack of supervised training data for both retrieval and generation in scientific domains.

Overview of OpenScholar

To address these challenges, OpenScholar introduces several key innovations that extend the standard RAG (refs. 1,5) model for scientific literature synthesis. Specifically, OpenScholar combines domain-specialized retrieval, citation-aware generation and a new self-feedback inference mechanism, all built on top of a fully open and large-scale scientific data store.

Formally, OpenScholar consists of three key components: a data store D, a retriever \({R}\) and a generator LM \({G}\). In standard retrieval-augmented inference pipelines, the process begins with \({R}\), which retrieves a set of passages P = {p 1 , p 2 ,…, p N } from D—a large-scale corpus of previously published scientific papers—based on semantic relevance to the input query x. These passages serve as context for the next step. The generator LM \({G}\) then takes both the retrieved passages P and the input query x to produce the output y along with corresponding citations C. Formally, this process can be represented as:

$$y,{\bf{C}}={G}(x,{R}(x,{\bf{D}})),$$

in which each c i in C corresponds to a specific passage from P.

... continue reading