HomeArticle

HLE "Human Last Exam" breaks through 60 points for the first time. Eigen-1 based on DeepSeek V3.1 significantly outperforms Grok4 and GPT-5.

量子位2025-09-28 20:03
Three major technological innovations support a breakthrough of 60 points.

For the first time, a system has exceeded the 60 - point mark on the expert - verified subset of the HLE (Humanity's Last Exam)!

Recently, the Eigen - 1 multi - agent system, jointly developed by teams including Xiangru Tang and Yujie Wang from Yale University, Wanghan Xu from Shanghai Jiao Tong University, Guancheng Wan from UCLA, Zhenfei Yin from the University of Oxford, and Di Jin and Hanrui Wang from Eigen AI, achieved a historic breakthrough.

On the HLE Bio/Chem Gold test set, the Pass@1 accuracy reached 48.3%, and the Pass@5 accuracy soared to 61.74%, crossing the 60 - point mark for the first time. This result far surpasses Google Gemini 2.5 Pro (26.9%), OpenAI GPT - 5 (22.82%), and Grok 4 (30.2%).

Most excitingly, this achievement is not based on a closed - source super - large model but is completely built on the open - source DeepSeek V3.1.

On this open - source foundation, the research team achieved a qualitative leap by integrating three innovative mechanisms: Monitor - based RAG (Implicit Knowledge Enhancement), HSR (Hierarchical Solution Refinement), and QAIR (Quality - Aware Iterative Reasoning).

Let's delve into the details below:

Technological Innovation: Three Pillars Support the 60 - Point Breakthrough

As AI begins to challenge the ultimate boundaries of human knowledge, an unprecedented competition is unfolding.

When large models have "reached 90 points" on traditional benchmarks such as MMLU and GPQA, these tests are gradually losing their discriminatory power. To track the real progress of AI in the frontier of scientific reasoning, the Center for AI Safety and Scale AI jointly launched the "Humanity's Last Exam" (HLE).

It covers more than 100 fields including mathematics, natural sciences, engineering, humanities, and social sciences, with a total of 3,000 doctoral - level difficult questions. It is regarded as the ultimate test for AI knowledge reasoning.

The HLE Bio/Chem Gold is the gold - standard subset of HLE, containing 149 questions manually reviewed and corrected by domain experts.

Compared with the original HLE dataset, this subset excludes questions with possible ambiguities or incorrect answers, ensuring the accuracy and reliability of the labels. Therefore, it has become the most reliable benchmark for evaluating AI's scientific reasoning ability.

It was on the HLE Bio/Chem Gold subset that the Eigen - 1 system crossed the 60 - point mark for the first time, thanks to its three innovative mechanisms.

1. Monitor - based RAG: Implicit Retrieval Enhancement to Eliminate the "Tool Tax"

Traditional Retrieval - Augmented Generation (RAG) systems are like video players that frequently pause. Every time they need external knowledge, they must interrupt the reasoning process, construct queries, process results, and then reintegrate the context.

The research team vividly calls this overhead the "Tool Tax". Every tool call interrupts the thinking process, leading to context loss.

The "Tool Tax" problem of traditional RAG systems is vividly demonstrated in the following population genetics case. The left side shows the model over - confidently using the wrong formula, while the right side shows that even after obtaining the correct formula through explicit RAG, the interruption of the reasoning process prevents the model from reintegrating the knowledge into the original problem.

Eigen - 1's Monitor - based RAG completely changes this paradigm:

Implicit Monitoring: The Monitor continuously monitors the uncertainty in the reasoning flow. Like a careful assistant, it silently pays attention to every moment that may need help in the background. It scans the reasoning trajectory to trigger RAG when uncertainty occurs.

Precise Querying: When the Querier detects uncertainty, it precisely extracts the minimum set of keywords, avoiding unnecessary expansion of the search space.

Seamless Injection: The Injector seamlessly integrates the retrieved knowledge into the reasoning flow, just like naturally supplementing background information in a conversation rather than inserting citations abruptly.

Experimental data shows that compared with explicit RAG, Monitor - based RAG reduces token consumption by 53.5% and the number of workflow iterations by 43.7%, while maintaining higher accuracy.

In the following haplotype counting case, the Monitor detects the uncertainty of recombination constraints, the Querier generates targeted queries, and the Injector injects two key facts, enabling the model to rule out invalid cases and arrive at the correct answer of 30 haplotypes.

2. Hierarchical Solution Refinement (HSR): From "Democratic Voting" to "Hierarchical Refinement"

In addition to implicit knowledge enhancement, Eigen - 1 also revolutionizes the collaboration mode of multi - agents.

Traditional multi - agent systems use a "democratic voting" mechanism, where all candidate solutions are treated equally, which easily "dilutes" the optimal solution.

The Hierarchical Solution Refinement (HSR) introduced by Eigen - 1 breaks this assumption. HSR uses an "anchor - repair" structure: one candidate serves as the anchor, and the others are used as references for sequential correction, forming hierarchical collaboration.

Under the HSR framework, each candidate solution takes turns to be the "anchor", and the other solutions serve as "references" to provide targeted corrections. This design allows strong solutions to absorb valuable insights from weak solutions rather than simply averaging them.

It specifically includes four repair dimensions: Logical Completion (filling in missing reasoning steps), Numerical Correction (correcting calculation errors), Method Replacement (replacing weaker methods with better strategies), and Expression Optimization (improving clarity without changing the essence).

This design enables high - quality solutions to absorb valuable insights from other solutions rather than simply averaging them.

The following figure vividly demonstrates the working principle of HSR through an image recognition task.

Facing a composite task of insect recognition and flower counting, the anchor solution initially chose ResNet (Option C), but there was an error in calculating the deployment time. By introducing other solutions as references, the system made four types of targeted corrections.

3. Quality - Aware Iterative Reasoning (QAIR): Quality - Driven Iterative Optimization

The Quality - Aware Iterative Reasoning (QAIR) can adaptively adjust the iteration depth according to the quality of the solution. High - quality solutions can converge early, while low - quality solutions trigger more exploration, thus achieving a balance between efficiency and accuracy.

This mechanism evaluates each solution in three dimensions: logic, answer correctness, and explanation completeness. Only solutions that do not meet the standards will enter the next round of correction, avoiding waste of computing resources on low - quality candidates.

Comprehensive Dominance: Beyond HLE

The advantages of Eigen - 1 are not limited to HLE:

1. HLE Bio/Chem Gold (149 questions)

Pass@1: 48.30% (13.4 percentage points ahead of SciMaster)

Pass@5:  61.74% (breaking the 60% mark for the first time)

2. SuperGPQA Biology (Hard version)

Pass@1: 69.57%

Pass@5: 78.26%

3. TRQA Literature Comprehension

Pass@1: 54.65%

Pass@5: 79.07%

Deep Insights: The Laws Behind Success

Analysis of Error Patterns

The pie chart in Figure 7 reveals a key insight: 92.78% of the errors involve problems in the reasoning process, and 88.66% involve problems in knowledge application, with a large overlap between the two.

This indicates that the core challenge of scientific reasoning lies not in simple knowledge retrieval or logical reasoning but in how to seamlessly integrate knowledge and reasoning.

In contrast, execution - following errors (13.40%) and understanding errors (9.28%) account for a relatively small proportion, indicating that the model is relatively mature in instruction understanding and execution.

Precise Quantification of Component Contributions

The team precisely quantified the contribution of each component through incremental construction and ablation experiments.

The baseline system could only achieve an accuracy of 25.3% without any external knowledge, consuming 483.6K tokens. After adding explicit RAG, the accuracy increased to 41.4%, but at the cost of a surge in workflow steps from 43.4 to 94.8, which is a direct manifestation of the "Tool Tax".

When the Monitor component was introduced, although the accuracy slightly decreased to 34.5%, the token consumption dropped sharply to 218.4K, and the workflow steps also decreased to 51.3.

With the addition of the Querier and Injector, the accuracy recovered to 40.3%. The introduction of HSR increased the accuracy to 43.7%, and finally, QAIR pushed the accuracy of the complete system to 48.3%, while maintaining efficient resource utilization (218.9K tokens, 53.4 steps).

The ablation experiments verified the necessity of each component from another perspective. Removing the Monitor caused the token consumption to soar to 461.3K and the workflow steps to increase to 95.3, demonstrating the great value of implicit enhancement.

Removing HSR or QAIR respectively led to a decrease in accuracy to 44.8% and 43.7%, proving the importance of hierarchical refinement and quality - aware iteration.

The Subtle Balance between Diversity and Consensus

The authors revealed a counter - intuitive but highly inspiring discovery through scatter plots and regression analysis.

In the information retrieval task (339 samples), the consistency between solutions has a weak positive correlation with accuracy (slope of 0.369), indicating that different retrieval paths and perspectives can bring complementary information, and diversity is beneficial.

In the reasoning task (392 samples), the situation is completely opposite - the consistency has a strong positive correlation with accuracy (slope of 0.851), indicating that when multiple reasoning paths reach the same conclusion, this conclusion is likely to be correct.

Therefore, retrieval - type tasks should encourage solution diversity and parallel routes, while pure reasoning tasks should tend towards early consensus and convergence.

This discovery provides important guidance for the task - adaptive design of future agent systems.

Precise Quantification of the Tool Tax

Finally, the authors intuitively demonstrated the great advantage of implicit enhancement over explicit RAG by comparing the relationship between accuracy improvement and token reduction.

The traditional baseline + RAG scheme can improve accuracy but at the cost of huge computational overhead, which is shown as an upward - right extension in the figure (accuracy improvement but token increase).

Eigen - 1 is located in the upper - left quadrant. While significantly improving accuracy, it reduces token consumption by 53.5% and the number of workflow iterations from 94.8 to 53.4, a reduction of 43.7%. This "having it all" result is the value of architectural innovation.

Significance: A New Paradigm for Scientific AI

The significance of Eigen - 1 crossing the 60 - point mark for the first time far exceeds a benchmark test: