Did RAG Search Correctly but Answer Incorrectly? The University of Saarland in Germany Uncovers the Truth

Sort out long documents and contradictory information effortlessly.

RAG (Retrieval-Augmented Generation) has become the standard technology for the implementation of large models. However, those who have used it know a pain point: even though the correct documents are retrieved, the answers given by the model are still far off.

What exactly is going on?

A research team composed of Saarland University in Germany, Tencent Youtu, Shanghai Jiao Tong University, Fudan University, and Zhejiang University has provided a precise diagnosis: the problem lies not in the search, but in reading comprehension. Existing RAG systems directly feed the retrieved paragraphs to the model as "loose parts," completely flattening the primary-secondary relationships within paragraphs and the logical connections between paragraphs. What the model sees is not an organized reference material, but a "mess of information."

To address this, the research team proposed Disco-RAG — a new RAG framework that adds a "comprehension" step between "searching" and "answering." This work has been accepted as a long paper at the main conference of ACL 2026. It achieved multiple best results on three authoritative benchmarks and requires zero training throughout.

First, let's look at an example: How does traditional RAG "give wrong answers"?

The user asks, "Can vitamin D supplementation prevent the flu?" The system retrieves two pieces of literature:

Paragraph A: "In the adult population with low vitamin D levels in winter, the incidence of the flu decreased by 12% after additional vitamin D supplementation."

Paragraph B: "Large-scale randomized controlled trials have not found a statistically significant association between vitamin D supplementation and the risk of the flu."

Traditional RAG simply combines A and B and hands them to the model. Seeing "a 12% decrease," the model directly outputs "vitamin D is effective" — completely ignoring the crucial qualifying condition in front of A ("winter + low-level population") and failing to notice that A and B are actually "contradicting" each other.

Behind this are two fatal blind spots of traditional RAG:

It cannot distinguish primary and secondary within paragraphs — the model cannot tell which sentence is the conclusion and which is the precondition at all.

It cannot see the relationship between paragraphs — the model has no idea whether the two pieces of literature support or contradict each other.

In other words: The shortcoming of RAG is not "inability to search," but "inability to understand what is retrieved."

Existing methods have not been ineffective

The industry has actually been aware of this problem for a long time. In the past few years, researchers have proposed various remedial ideas: reordering the retrieval results to place the most relevant paragraphs at the front; rewriting the user's query to make the search more accurate; compressing redundant paragraphs to reduce the interference of irrelevant information; and even having the model perform multiple rounds of iterative retrieval to gradually approach the answer.

These methods are indeed effective, but they always optimize the "search" step — there is an implicit assumption behind them: as long as better content is presented to the model, the model will naturally be able to give good answers.

However, in reality, the content is often "good enough," and the problem lies in the model's inability to organize the content after receiving it. When there are complex logical relationships between multiple paragraphs — for example, one gives a conclusion under certain conditions, and the other gives the results of a large-scale experiment with the opposite conclusion. Simply reordering or compressing them cannot help the model understand the relationship between these paragraphs.

This is exactly the core problem that Disco-RAG aims to solve: It is not about making the model see better content, but about making the model truly understand the existing content.

How does Disco-RAG solve it? Three steps to teach the model to "read" documents

The idea is straightforward: Insert a "comprehension" step between "searching" and "answering." Use the classic Rhetorical Structure Theory (RST) in linguistics to analyze the text logic, and then let the model start writing.

It takes three steps throughout, without changing a single parameter of the model:

Step 1: Draw an "argument tree" for each paragraph. Use the LLM to break down the paragraph into the smallest semantic units (EDUs), then mark each unit as "core content" or "auxiliary explanation," and simultaneously identify the relationship type between the units (such as cause-effect, contrast, expansion, etc.). In this way, the model can distinguish which is the focus of the paragraph, "a 12% decrease" or "only for a specific population."

Step 2: Weave a "relationship network" for all paragraphs. Conduct pairwise analysis on all the retrieved paragraphs to predict whether they support, refute, supplement, or have no relation to each other, and finally form a directed graph. In the above example, the system will mark a "contrast" relationship between A and B.

Step 3: Make an outline first, then write the answer. Combining the user's question, the original paragraphs, the argument tree, and the relationship network, Disco-RAG automatically generates a "writing outline." The outline indicates the key evidence to be cited, the order of narration, and how to coordinate conflicting information. Finally, the model uses the outline as a guide to produce the final answer.

Back to the vitamin D example

What exactly will happen when Disco-RAG handles the question "Can vitamin D prevent the flu?"

First, the argument tree will analyze the internal structure of paragraph A, marking "in the adult population with low vitamin D levels in winter" as a qualifying condition (auxiliary unit) and "the incidence of the flu decreased by 12%" as a core conclusion (core unit). This means that the model will no longer regard a local conclusion with a precondition as a universal fact.

Next, the relationship network will establish an edge of a "contrast" relationship between paragraph A and paragraph B — clearly telling the model that the stances of these two pieces of literature conflict and that the answer cannot be simply taken from one of them.

Finally, the writing outline will plan the answering strategy accordingly: first introduce the findings and applicable scopes of the two studies respectively, then point out the contradiction between them, and finally give a conditional comprehensive judgment.

In this way, the model's final answer will no longer be a simple and crude "effective" or "ineffective," but a well-structured, conditional, and evidence-based analysis. This is exactly what users expect from high-quality answers.

Report card: Leading across three benchmarks

The team conducted a comprehensive evaluation on three authoritative benchmarks covering different scenarios, using multiple open-source models without any training.

Long document reasoning (Loong)

This benchmark tests the model's reasoning ability on extremely long documents, with document lengths ranging from 10,000 to 250,000 tokens. The core finding: The longer the document, the greater the advantage of Disco-RAG. At the longest 250,000-token level, ordinary RAG almost completely fails, while Disco-RAG can still give effective answers. More notably, the overall performance of Disco-RAG even exceeds that of methods that require specialized training.

Ambiguous question answering (ASQA)

Facing ambiguous questions, Disco-RAG has refreshed the best records on core indicators. More notably, even when using a model with a very small number of parameters, Disco-RAG can reach the level of various specially designed systems.

Science summarization (SciNews)

Transforming academic papers into popular news summaries — this task tests comprehensive understanding and expression abilities. Disco-RAG won first place in three out of four evaluation indicators and ranked second in factual consistency.

Does the improvement really come from "understanding the structure"?

The team conducted a series of control experiments to verify this:

The three modules have different functions and are all indispensable. After removing the argument tree, relationship network, and outline steps respectively, the performance will decline significantly, indicating that each of the three plays a different role.

Adding only a plan is not enough; the structure must be added. Adding a general planning step (without discourse structure) to ordinary RAG results in limited improvement. The significant improvement of Disco-RAG mainly comes from the structured representation of "argument tree + relationship network." This shows that the model is indeed using the logical structure of the text, rather than simply because the input has become longer.

It remains robust in the face of noise and changes in granularity. Even when a large number of retrieval results are replaced with irrelevant content or the paragraph segmentation granularity is significantly adjusted, ordinary RAG fluctuates greatly, while Disco-RAG always maintains stable performance.

Actual deployment: Use small models for analysis and large models for generation

The three modules of Disco-RAG (argument tree, relationship network, and outline) and the final answer generation are decoupled and can be handled by models of different sizes. The team conducted a set of hybrid deployment experiments: using the Llama-3.1-8B with a relatively small number of parameters to handle all the structural analysis modules and only invoking the Llama-3.3-70B in the final generation step.

The results show that using a small model for structural analysis and a large model only for final generation can restore most of the performance gain. Even running Disco-RAG entirely with an 8B small model yields far better results than running ordinary RAG with a 70B model. This means that the implementation cost of Disco-RAG can be very flexible. The structural analysis module can be "downscaled" for deployment while still retaining the core benefits.

Combined with training: Discourse structure and fine-tuning are not in conflict

Since Disco-RAG can improve efficiency without training, what will happen if training is added? The team made a comparison on the SciNews summarization task:

Two key findings: First, Disco-RAG without training already outperforms ordinary RAG after fine-tuning, which shows that the value of structural information cannot be underestimated. Second, when fine-tuning is combined with discourse structure, the effect is further improved, indicating that the benefits brought by the two are complementary rather than overlapping. This points out a clear path for practical applications: First, use Disco-RAG without training to obtain immediate benefits, and then add fine-tuning as needed to further improve the performance.

Summary

Disco-RAG provides a clear idea: Instead of constantly optimizing the "search," it is better to first teach the model to "read."

Add a layer of discourse structure analysis between retrieval and generation. Let the model understand the primary-secondary relationships within paragraphs, clarify the logical connections between paragraphs, and then make an outline before writing. Without training and without modifying the model, the answer quality of RAG can be significantly improved. This is especially obvious in long document and noisy scenarios.

From a broader perspective, this work reveals a long-neglected fact: Natural language text is not a simple stack of sentences; it has its own logical framework — with primary and secondary elements, cause and effect, twists and turns, and echoes. When we restore this framework to the model, the model's ability to understand and organize information will undergo a qualitative change.

This idea is not only applicable to the RAG scenario but may also provide new inspiration for more extensive tasks such as multi-document reasoning and long text

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Did RAG search correctly but answer incorrectly? The University of Saarland in Germany has found the truth.