首页文章详情

AI defeats human Ph.D. candidates for the first time, instantly transforms top conference papers into code. A post-1990s researcher from the University of Hong Kong open-sources a project that has amassed over 8,000 stars.

新智元2025-11-03 13:11
DeepCode, an open-source project by the team led by Huang Chao from the University of Hong Kong, has for the first time outperformed machine learning Ph.D. students from eight top universities including the University of Cambridge and the University of California, Berkeley in the PaperBench test in terms of "paper reproduction code", and outpaced advanced commercial code agents such as Claude Code and Cursor.

In the field of AI, academic papers often carry the latest breakthroughs in algorithms, model architectures, and other aspects.

However, truly understanding the core knowledge of a paper and successfully reproducing the algorithms and experimental results within it often pose significant challenges.

The main crux of the problem lies in the lack of "key implementation details"!

In reality, paper authors usually highly abstract complex algorithmic logic into a few mathematical formulas, omitting the core details that truly affect success or failure, such as:

The specific value ranges of hyperparameters, skillful adjustments during the training process, detailed steps of data preprocessing, network initialization strategies, etc.

However, it is precisely the lack of key implementation points that leads to a huge gap between theory and practice.

Even senior researchers often find themselves at a loss in such situations.

How to solve this problem?

Recently, DeepCode, an open - source project by Professor Huang Chao's team from the University of Hong Kong, has provided a powerful AI tool to address this challenge.

It can not only analyze paper content and understand algorithmic logic but also automatically generate runnable code.

In benchmark tests, DeepCode has demonstrated outstanding performance in terms of reproduction success rate and code quality, outperforming machine - learning Ph.D. students from top universities in multiple metrics.

Since the release of its first version, DeepCode v1.0.0, in July this year, DeepCode has attracted wide attention, topped the GitHub Trending list, and received nearly 8,000 stars (as of November 1st).

Open - source link: https://github.com/HKUDS/DeepCode

Leading in Four Major Benchmark Tests

Researchers compared DeepCode in the following four major benchmark tests:

Human experts;

The most advanced commercial code agents;

Scientific code agents;

Agents based on large models.

The results show that DeepCode achieved the highest scores in all tests.

Surpassing Human Experts for the First Time: 75.9% vs 72.4%

In the PaperBench benchmark test released by OpenAI, DeepCode achieved an overall accuracy of 75.9%, surpassing the 72.4% score of the human expert group involved in the evaluation.

The specifications of the PaperBench benchmark test are as follows:

Data source: A standardized evaluation benchmark officially released by OpenAI;

Task scale: Complete reproduction of 20 papers from the ICML 2024 conference;

Evaluation dimension: 8,316 independent scorable components;

Scoring mechanism: SimpleJudge hierarchical weighted evaluation system;

Task complexity: End - to - end implementation from paper text to executable code is required.

To ensure the scientific rigor of the experiment, the research team also established a high - quality human expert baseline.

First, there were strict qualification standards for human experts.

These experts are all machine - learning Ph.D. students (current or graduated) from 8 top research universities.

The 8 universities are UC Berkeley, Cambridge, CMU, Columbia, Cornell, Purdue, TU Wien, and UMass Amherst.

In addition, the research team adopted a strict human expert screening process:

First, conduct resume pre - screening and academic background verification;

Next, implement a standardized test of machine - learning theoretical knowledge;

Then, evaluate Git version control and software engineering practice abilities;

Finally, verify the candidate's complete skill set in paper reproduction tasks.

The above screening process ensures that all participants have the full - process ability from theoretical understanding to code implementation.

The experimental environment configuration is as follows:

Computing resources: Standard configuration of NVIDIA A10 GPU, with some using A100;

Development time: A flexible 4 - week development cycle;

Tool permissions: Unlimited use of commercial AI assistants such as ChatGPT and GitHub Copilot;

Attempt mechanism: Each paper has 3 independent reproduction opportunities, using the best@3 scoring strategy.

The above experimental results fully prove that:

Facing complex tasks that require in - depth understanding and long - term development, even when human experts can fully utilize various AI assistant tools, DeepCode can still achieve a higher level in terms of code quality and accuracy.

This indicates that DeepCode not only reaches but also surpasses the expert - level code reproduction ability, which also represents an important milestone in the field of autonomous scientific software engineering.

Superior to Existing AI Coding Tools: 84.8% vs 58.7%

On the same benchmark, researchers randomly selected 5 papers from 20 and conducted a systematic performance comparison between DeepCode and current mainstream commercial code agents.

DeepCode showed a significant leading advantage in the evaluation:

DeepCode scored 84.8%, leading Claude Code (58.7%) by approximately 26.1 percentage points.

To ensure the fairness and authority of the test, all commercial code agents involved in the evaluation were equipped with the most advanced foundation models: Claude 4.5 Sonnet - think and GPT 5 codex - high.

The results suggest that the performance gap mainly comes from the multi - agent architecture design rather than simply the difference in foundation models.

In addition, DeepCode also maintained its lead in the evaluations of scientific code agents and large - model - based agents:

Compared with the most advanced scientific code reproduction framework, PaperCoder (51.1%), DeepCode achieved a reproduction rate of 73.5%, an increase of 22.4 percentage points.

This significant improvement verifies that the research team's multi - module architecture, which combines planning, hierarchical task decomposition, code generation, and iterative debugging, is superior to simpler pipeline - based methods.

Compared with the best - performing large - model agent (43.3%), DeepCode (73.5%) showed an increase of 30.2 percentage points.

This indicates that for complex code reproduction tasks, a complex agent scaffold (rather than longer inference time or larger models) is crucial.

Three Core Capabilities of DeepCode

Paper2Code (Paper → Code)

Input: Academic paper PDF documents;

Output: Production - level code implementation + complete test suite + detailed technical documentation.

DeepCode's core advantage lies in its ability to automatically parse complex mathematical formulas, understand algorithmic logic, and generate high - quality runnable code, which can help researchers quickly reproduce SOTA algorithms, verify theoretical innovations, and accelerate research progress.

Text2Web: (Idea → Web Page)

Input: Natural - language descriptions of interface requirements and functional expectations;

Output: Responsive front - end pages + modern UI design + complete interaction logic.

DeepCode can intelligently understand user intentions, automatically adapt to mobile devices, and generate interfaces that meet design specifications, making it suitable for scenarios such as rapid prototype verification, MVP product development, and the implementation of entrepreneurial ideas.

Text2Backend: (Requirement → Service)

Input: Descriptions of backend functional requirements and business logic;

Output: High - performance API interfaces + optimized database design + scalable system architecture.

DeepCode can automatically select the best technology stack, consider performance and security, and support cloud - native deployment, making it suitable for scenarios such as rapid development of microservices, refactoring of legacy systems, and enterprise digital transformation.

The Core Technical Framework of DeepCode

DeepCode adopts a systematic three - stage framework, decomposing the complex code - generation task into three steps: architecture blueprint construction, code implementation, and automatic verification, and achieving automatic conversion from documents to executable code through multi - agent collaboration.

Stage One: Architecture Blueprint Construction

This stage transforms lengthy document specifications into structured architecture blueprints, addressing the challenges of long - document understanding through three key steps: hierarchical content segmentation, multi - agent in - depth analysis, and architecture blueprint fusion.

In the multi - agent in - depth analysis phase, two professional agents, the concept agent and the algorithm agent, conduct parallel in - depth analysis of different dimensions of the document, ensuring both a global perspective and specific implementation details.

The code - planning agent integrates the in - depth outputs of the above two analysis agents, coordinates high - level architecture and low - level specifications, and resolves potential inconsistencies.

Through an intelligent fusion process, a complete architecture blueprint is finally generated, providing detailed guidance for subsequent code generation.

Stage Two: Automated Code Construction

This stage conducts a systematic construction of the code repository based on the completed architecture blueprint, addressing the core challenges of maintaining cross - file consistency and the lack of domain knowledge in large - scale codebases through a dual - mechanism design.

Stage Three: Dynamic Verification and Optimization

This stage constructs a multi - level quality - assurance system, achieving comprehensive assurance of code from structural integrity to functional correctness through a dual - verification mechanism of static analysis and dynamic execution, forming a self - improving closed - loop feedback system.

Challenges and Reflections on AI Coding

Currently, AI programming tools perform well in code completion and simple tasks but still fall short in complex tasks that require in - depth understanding.

The reproduction of scientific papers is a typical example - it requires AI to understand mathematical principles, convert abstract concepts into code implementations, and handle various technical details.

DeepCode's progress in this field shows that through specialized architecture design, AI can achieve good results in specific fields, but the general in - depth understanding ability still has limitations.

How to enable AI to better understand complex business logic and technical requirements remains an open question.

· From Auxiliary Tools to Development Partners

AI programming tools are evolving from simple code completion to more comprehensive development support.

The complete process demonstrated by DeepCode, from requirement analysis to code generation and then to quality verification, represents this development trend.

However, this also brings new problems:

How to maintain effective developer control over the project while the AI system provides more autonomy?

How to ensure that the generated code meets the team's coding specifications and architecture requirements?

These problems need to be gradually solved in technological progress and engineering practice.

· Practical Considerations of Vibe Coding

The rise of Vibe Coding has lowered the programming threshold, allowing more people to participate in software development.

However, this model also brings a series of challenges:

How to ensure the quality and consistency of the generated code?

When developers pay less attention to underlying details, how to maintain the long - term maintainability of the code?

How to ensure the security and stability of the code while improving development efficiency?

DeepCode's verification mechanism provides an idea, but more complete engineering practices and quality standards still require further exploration and improvement in the industry.

Author Introduction

Li Zongwei

Li Zongwei

Li Zongwei (born in 1999) is currently a Ph.D. student at the University of Hong Kong, under the supervision of Professor Huang Chao. He focuses on the research of cutting - edge technologies of large - model agents. His academic achievements have been selected for the list of the most influential papers in CIKM 2024. As a core contributor to the open - source project DeepCode, the project has received approximately 8,000 stars on GitHub.

Li Zhongxing