Breaking the Bottleneck: Enabling RAG to Think. USTC, BAAI, etc. Release Reasoning Retrieval Framework BGE-Reasoner
The wave of artificial intelligence is pushing us towards a new era defined by RAG and AI Agents. However, to make these agents truly "intelligent" rather than mere information carriers, we must overcome a core challenge that stands in front of all top - tier teams. This challenge is Reasoning - Intensive Information Retrieval (Reasoning - Intensive IR).
It is not only a key bottleneck in the current development of RAG and AI Agent technologies but also decisive for the success of applications such as large - model agents and in - depth research (DeepResearch).
While global researchers are seeking breakthroughs, we've witnessed a contribution from China: BGE - Reasoner.
BGE - Reasoner, developed by a joint team from institutions such as the University of Science and Technology of China, Beijing Academy of Artificial Intelligence, Beijing University of Posts and Telecommunications, and the Hong Kong Polytechnic University, is an innovative end - to - end solution for reasoning - intensive information retrieval tasks. Through systematic query understanding, vector retrieval, and re - ranking, this solution can significantly improve the performance of search engines in reasoning - intensive information retrieval tasks.
On the authoritative evaluation benchmark BRIGHT, BGE - Reasoner achieved a test score of 45.2, refreshing the best record of this benchmark with a significant advantage.
As another important milestone in the BGE series of models, BGE - Reasoner not only achieved a performance breakthrough but also provided an effective new paradigm for solving the industry problem of reasoning - intensive retrieval. From a technical perspective, the core innovations of this achievement are mainly reflected in the following three aspects:
- A replicable framework: A three - stage modular framework consisting of Rewriter, Embedder, and Reranker is proposed, providing a clear and efficient engineering paradigm for handling complex queries.
- Data - driven innovation: The feasibility of using large models to synthesize high - quality, multi - domain reasoning training data is explored and proven, ingeniously solving the core bottleneck of scarce training data in this field.
- Empowered by reinforcement learning: Reinforcement learning is successfully applied to Reranker training, enabling the model to have stronger reasoning and generalization abilities when facing difficult samples.
The relevant model weights, training codes, and training data will soon be open to the community to further promote the research and application development in this field.
Project homepage: https://github.com/FlagOpen/FlagEmbedding/tree/master/research/BGE_Reasoner
Introduction
Reasoning - Intensive Information Retrieval (Reasoning - Intensive IR) is a new type of information retrieval task that has emerged in recent years. Different from traditional retrieval, it not only relies on semantic matching but also requires comprehensive use of deep logical reasoning, multi - step semantic chains, and relevant background knowledge to establish a correct semantic association between the query and the target document.
To promote research in this field, the University of Hong Kong, Princeton University, and Stanford University jointly proposed the first authoritative evaluation benchmark for reasoning - intensive retrieval, BRIGHT. This benchmark brings together real queries from fields such as StackExchange, LeetCode, and math competitions and pairs them with relevant documents that require multi - step reasoning to identify, for evaluating the capabilities of retrieval systems in complex reasoning scenarios.
Under the BRIGHT benchmark, traditional methods relying on keyword matching or simple semantic similarity often struggle to locate truly relevant target documents, revealing the deficiencies of current retrieval systems in complex reasoning scenarios. Therefore, how to improve system performance in reasoning - intensive retrieval has become a key issue in promoting the development of Retrieval - Augmented Generation (RAG) in complex reasoning tasks.
Figure 1. Different from retrieval tasks based on keywords and direct semantic matching, the BRIGHT evaluation benchmark focuses on retrieval tasks in reasoning - intensive scenarios.
In this context, BGE - Reasoner demonstrated excellent performance in reasoning - intensive retrieval tasks. In the BRIGHT leaderboard, it surpassed the results previously submitted by institutions such as Ant Group, Baidu, ByteDance, Renmin University of China, and the University of Waterloo, and refreshed the record with an advantage of 3.6 points over the second - place. Meanwhile, its built - in vector model BGE - Reasoner - Embed also significantly outperformed current top - tier baseline models such as Seed1.5 - Embedding, Qwen3 - Embedding, and GTE, showing a significant performance improvement.
Figure 2. On the BRIGHT leaderboard, BGE - Reasoner achieved SOTA performance and ranked first on August 21st. BGE - Reasoner - Embed performed well using native queries and achieved SOTA results among vector models. Leaderboard link: https://brightbenchmark.github.io
Figure 3. Comparison of the retrieval performance of BGE - Reasoner and BGE - Reasoner - Embed with baseline models on BRIGHT.
Technical Analysis
BGE - Reasoner adopts the classic three - module system in information retrieval:
- Query understanding —— BGE - Reasoner - Rewriter: Understand and rewrite the initial query to generate an optimized query more suitable for retrieval;
- Vector model —— BGE - Reasoner - Embed: Collaborate with BM25 to perform retrieval using the rewritten query and obtain a set of candidate documents;
- Ranking model —— BGE - Reasoner - Reranker: Re - rank the candidate documents to obtain a more accurate ranking result.
In the actual workflow, the user's original query is first rewritten by BGE - Reasoner - Rewriter, then BGE - Reasoner - Embed and BM25 perform parallel retrieval to obtain candidate documents, and finally BGE - Reasoner - Reranker performs fine - ranking. The system integrates multi - path results and outputs the final ranking to complete the end - to - end reasoning - based retrieval process. The complete framework is shown in the following figure:
Figure 4. Schematic diagram of the end - to - end retrieval process of BGE - Reasoner.
Data synthesis. Different from traditional open - ended question - answering scenarios, training data in reasoning - intensive information retrieval scenarios is very scarce. To solve this problem, the research team from Beijing Academy of Artificial Intelligence and its partner institutions resorted to the data synthesis strategy based on large language models. Specifically, based on the knowledge - intensive corpus in real - world scenarios, high - quality reasoning - intensive queries for specific scenarios are synthesized, and then the powerful understanding ability of large language models is used to construct high - quality positive and negative examples for each query. Finally, a set of high - quality reasoning - intensive retrieval training data covering multiple fields such as mathematics and code is constructed to support the training of subsequent modules.
Query understanding. In the query understanding module, researchers generate multiple reasoning paths based on the aforementioned synthetic data with the help of a teacher model with strong reasoning ability, and screen high - quality results through a rejection sampling strategy to construct training samples. Then, these training data are used to fine - tune the Qwen2.5 - 7B - Instruct model, thereby significantly improving its ability in query understanding and rewriting, and finally obtaining BGE - Reasoner - Rewriter.
Vector model. The built - in vector model BGE - Reasoner - Embed is fine - tuned based on the Qwen3 - 8B base model. Relying on high - quality synthetic training data, the model's ability in reasoning - intensive retrieval tasks has been significantly enhanced. Under the BRIGHT benchmark, whether based on the original query or the GPT - 4 reasoning query, BGE - Reasoner - Embed achieved the best retrieval performance among current vector models, fully verifying the effectiveness of the constructed synthetic data.
Ranking model. The built - in ranking model BGE - Reasoner - Reranker is fine - tuned based on the Qwen3 series of base models. Combining the definition of relevance in the task scenario, the model can conduct fine - grained reasoning between the query and candidate documents, identify key information fragments, and accurately evaluate relevance. During the training process, reinforcement learning is introduced to improve the model's reasoning ability on difficult samples; during the reasoning stage, the model obtains a more robust relevance score through test - time augmentation, thereby further enhancing the ranking performance.
Figure 5. Schematic diagram of the reasoning process of BGE - Reasoner - Reranker.
Conclusion
The excellent performance of BGE - Reasoner fully verifies the important role of reinforcement learning and synthetic data in reasoning - intensive information retrieval, providing key support for the future development of Agent Search.
Beijing Academy of Artificial Intelligence will continue to focus on vector models and retrieval - augmented technologies, continuously improving the capabilities and generality of the BGE series of models. In the future, we look forward to collaborating with more research institutions and industrial partners to jointly promote the development of retrieval and artificial intelligence. We welcome researchers and developers to pay attention to and use the BGE series of models to jointly build an open and prosperous open - source ecosystem.
This article is from the WeChat official account "Machine Intelligence" and is published by 36Kr with authorization.