StartseiteArtikel

Just now, the world's first fully open AI for scientific literature review was published in Nature.

账号已注销2026-02-05 10:23
The citation accuracy is comparable to that of human experts.

On February 4th, Nature published a research achievement led by the University of Washington and the Allen Institute for Artificial Intelligence - OpenScholar. This is the world's first fully open - source Retrieval - Augmented Generation (RAG) language model specifically designed for scientific research. It can not only conduct precise retrieval and reject hallucinations but also generate high - quality referenced answers.

The citation accuracy of OpenScholar is comparable to that of human experts. Although it still needs further optimization, this tool is expected to help scientists handle the complex and increasingly onerous task of scientific literature reviews.

Paper link: https://www.nature.com/articles/s41586-025-10072-4

Although large language models (LLMs) have performed excellently in many fields, they still face severe challenges in scientific research assistance tasks. With the rapid growth of the total volume of scientific literature, the models struggle to keep up with the latest progress and often accompany serious "hallucination" phenomena. Experimental data shows that the proportion of incorrect citations when GPT - 4o quotes scientific literature is as high as 78% to 90%.

By integrating 45 million open - access papers and a unique self - feedback mechanism, OpenScholar achieves precise literature retrieval and accurate citation generation, effectively solving the problems of accuracy and credibility in the synthesis of scientific knowledge by existing models.

The first fully open - source AI system for scientific literature reviews

OpenScholar is a retrieval - augmented language model specifically designed for scientific research tasks. It answers scientific queries by identifying relevant paragraphs from 45 million open - access papers and synthesizing content with citation support.

The outstanding performance of OpenScholar stems from its three core technological innovations:

1. Exclusive Database (OSDS): OpenScholar has an exclusive knowledge base - OSDS, which constructs a completely open and up - to - date corpus, covering 45 million open - access scientific papers and 236 million paragraph embedding vectors. This large - scale data provides a reproducible basis for training and inference, ensuring the comprehensiveness and timeliness of retrieval.

2. Adaptive Retrieval: To accurately locate information in the vast ocean of literature, the system uses a specially trained retriever. This goes beyond simple keyword matching and can accurately identify and extract the most relevant literature paragraphs based on the semantic depth of the query, providing high - quality context for subsequent generation.

3. Self - feedback Mechanism: This is the key technological innovation of OpenScholar. The model introduces a "self - feedback" inference loop. After generating a preliminary answer, the model checks its own output, evaluates its factuality, coverage, and citation accuracy, and iteratively optimizes accordingly. This self - reflection mechanism significantly improves the quality of the final answer.

Figure | The overall architecture of OpenScholar. OpenScholar includes a dedicated data store, a retriever, and a language model, and iteratively optimizes responses through self - feedback inference during the retrieval process.

Performance Evaluation: Comprehensive Superiority over Existing Systems

Previous evaluations of literature synthesis usually focused on short - text outputs, multiple - choice forms, or specific - domain reasoning tasks. For this reason, the research team introduced ScholarQABench - the first large - scale, multi - domain open - ended scientific literature comprehensive evaluation benchmark, aiming to realistically simulate the challenges at the forefront of scientific research. It contains 2967 expert - written queries and 208 long - form answers, covering the fields of computer science, physics, neuroscience, and biomedicine, and requires generating long - form answers based on the latest literature from a large number of papers.

Figure | An overview of ScholarQABench. This test contains 2200 expert - written interdisciplinary scientific questions, and the research team developed automatic and manual evaluation schemes for it.

In this rigorous new benchmark test, OpenScholar achieved the following key results:

The smaller and lighter model, OpenScholar - 8B, exceeded GPT - 4o by 6.1% in comprehensive accuracy and exceeded the dedicated system PaperQA2 by 5.5%, achieving comprehensive performance leadership.

In terms of citation accuracy, OpenScholar not only reached the level of human experts but also demonstrated systematic advantages. Analysis shows that human - written answers scored 9.6 points higher than non - retrieval GPT - 4o based on the scoring criteria, and the performance of OpenScholar - 8B was only 2.9 points lower than that of human experts.

Figure | Statistics of expert - written answers.

In the human expert evaluation, experts clearly preferred the answers generated by OpenScholar. Specifically, OpenScholar using the 8 - billion - parameter model trained by the research team and GPT - 4o defeated human - generated answers with winning rates of 51% and 70% respectively, while the winning rate of the unenhanced original GPT - 4o was only 31%, lower than the human expert baseline.

Figure | Automatic and manual evaluation results: Experimental data based on the computer science subset of ScholarQABench (Scholar - CS, 100 questions) shows that the OpenScholar system using the 8B model or GPT - 4o trained by the team significantly outperformed other systems, and in more than 50% of the cases in the manual evaluation, it was better than experts. This manual evaluation was conducted by 16 doctoral experts on 108 questions in Scholar - Multi.

In addition to its excellent performance, OpenScholar also focuses on practicality in design. The lightweight dedicated retriever it uses significantly reduces the system's operation and computing costs compared to the scheme that relies on a large general model for retrieval, making high - quality and reliable literature review assistance more sustainable and widely applicable.

Limitations and Future Outlook

Although OpenScholar has made breakthrough progress, there are still limitations in the current evaluation framework and system.

ScholarQABench mainly focuses on computer science, biomedicine, and physics and has not covered other important disciplines such as social sciences and engineering. Therefore, the research findings may not be fully generalized to other fields. Due to the high cost and time - consuming nature of expert annotation, the scale of the evaluation set based on manual annotation is small, which may introduce variance and annotator professional bias. Moreover, ScholarQABench is a static public benchmark, and there is a risk of data contamination in the future, increasing the possibility of exposure during training or searching.

In some complex queries, OpenScholar still cannot guarantee always retrieving the most representative or latest relevant papers. Although the 8 - billion - parameter OpenScholar - 8B model has performed excellently, it has limited capabilities in instruction following and scientific knowledge understanding, which may lead to factual biases in the output. The OpenScholar - GPT - 4o version relies on the proprietary API of GPT - 4o. As the underlying model is updated, the experimental results may be difficult to fully reproduce, which poses challenges to the reproducibility of the research. In addition, the current system only uses open - access papers, and how to reasonably and legally integrate a large number of copyrighted academic literature remains an urgent problem to be solved.

Currently, the research team has open - sourced the core resources of OpenScholar, including code, data, model checkpoints, data storage, and ScholarQABench to support and accelerate future research work.

On this basis, future work will focus on integrating user feedback from the platform to continuously optimize retrieval quality, citation accuracy, and overall usability. At the same time, the team plans to further expand the application boundaries, extend the support scope to more scientific fields and multilingual scenarios, and actively seek cooperation with academic publishing institutions to explore a compliant data use mechanism that takes into account intellectual property rights and open access.

This article is from the WeChat public account "Academic Headlines" (ID: SciTouTiao), author: Wang Yueran. It is published by 36Kr with authorization.