HomeArticle

Just now, CHEN Deli from DeepSeek co-wrote a paper with two AIs.

机器之心2026-05-27 10:50
A 46-page paper, 99% of which was written by an Agent.

“With CodeAgent, I can finally pick up many things that I had to put on hold due to lack of energy in the past. Writing a blog is one of them. Approximately 1% of this blog was written by me, and 99% was written by the Agent.” 😂

Just now, Deli Chen, a researcher at DeepSeek, shared an article on X, “From Copilots to Colleagues: A Survey of Autonomous Research Agents,” which was deeply involved and completed by an AI Agent.

Article link: https://victorchen96.github.io/auto_research_survey.pdf

Deli Chen also specifically stated that this article is more of an interest-driven attempt. On the one hand, it's for fun, and on the other hand, it's to test his self-developed DeliAutoResearch skill. Therefore, it is not a strictly academic paper, and the views in the article only represent personal opinions and do not represent the positions of any company or organization.

This paper has gone through a total of 6 iterations (V1: 4 iterations, V2: 1 iteration, V3: 1 iteration). The first draft of V1 took 76 minutes, and the total time spent was 6 days. It has gone through approximately 108 rounds of Agent interactions, consuming about 648,000 tokens, with a total of 2,234 lines of LaTeX.

All 103 references have been verified. The length has increased from 45 pages to 46 pages. It contains 7 figures and 4 tables. Now the paper has a total of 46 pages, and the file size is 538KB. 😂

After completing this article, Deli Chen put forward an interesting judgment, which he described as a personal extreme view: Code Agents are causing a crazy inflation in computer science papers. In the past, the same work would have taken at least a month.

Deli Chen said that the “total CPU” time truly spent on thinking in this process was less than 2 hours.

For a brief introduction, Deli Chen, the first author, is from DeepSeek and is one of the core contributors to the architectures of V1, V2, V3, V4, R1, DeepSeek-Coder, and DeepSeek-MoE. He also represented DeepSeek at the World Internet Conference.

Blog link: https://victorchen96.github.io/

The other two “co-authors” are DeepSeek-V4-Pro and GPT-Image2. The former is responsible for the text, and the latter is responsible for the images.

That is to say, in essence, Deli Chen used AI to write a review on AI in scientific research. This setting is also an important experiment. Deli Chen built an autonomous scientific research intelligent agent framework called “Deli AutoResearch SKILL,” and part of the content of this 45-page article was produced using it. In addition, he also stated in the paper that this review was published in the name of a “personal research project,” and the views do not represent the positions of any company.

The researcher himself has become the object of research. What does this mean? The rest of the paper will gradually clarify it.

The review covers more than 95 papers and systematically analyzes 17 mainstream systems, trying to draw a clear map for a chaotic and growing field for the first time. This field is called “Autonomous Research Agents.” Given a scientific research goal, AI can independently complete the entire cycle from hypothesis formulation, experimental design, code execution, result analysis to paper writing, without the need for human approval at each step.

This is no longer a concept. In the past 18 months, on the SWE-bench benchmark for measuring software engineering capabilities, the ratio of AI solving real GitHub problems has climbed from less than 5% to over 70%. Some systems can produce complete academic papers at a cost of $15 per paper and have passed the initial human review. Some systems have discovered new mathematical structures beyond the known boundaries without human guidance.

AI is changing from a “research tool” to a “researcher” itself, and the speed is beyond everyone's expectations.

Background: “Copilot” or “Colleague”?

To understand the significance of this change, let's first imagine a traditional research assistant. Given a research topic, he can help you search for literature, organize tables, and execute code. But you need to tell him what to do at each step. When he encounters a problem, he will stop and wait for your instructions. He won't actively think about “what is more valuable to research next.”

This is the role AI has played in the past few years - the copilot. The steering wheel is always in human hands.

What is happening now is an “experiment of power transfer.” A new generation of intelligent agent systems is trying to independently complete the entire scientific research cycle: formulating hypotheses, designing experiments, executing code, analyzing results, writing reports, and even self-reviewing and iterating. From start to finish, there is no need for human approval at each step.

How fast is this transformation? Researchers describe it as “rapid and decisive.” In just 18 months, it has evolved from a tool to a colleague.

However, the meaning of “colleague” also varies greatly. Some systems can only run a piece of code without errors, while others can synthesize compounds independently in a robot laboratory. To establish order in this chaotic landscape, a unified language is needed. This is the core contribution of this review.

Core Contribution 1: Establish a Five-Level Classification for “Autonomy Level”

The most important contribution of this review is to propose a classification system of autonomy levels from L1 to L5, analogous to the SAE standard for automobile driving automation:

L1 (Auto-Completion) is the most common state. GitHub Copilot and various code auto-completion tools fall into this category. AI predicts the next line of code, but you are in control of everything. Productivity is increased by about 30% to 55%, but there is no autonomy at all.

L2 (Task Execution) is the level at which most people interact with ChatGPT and Claude in daily life. AI can decompose tasks and call tools, but it needs your approval at each step. You are the strategy decision-maker, and AI is the executor.

L3 (Multi-Step Autonomy with Checkpoints) is the position of current mainstream “intelligent agent programming tools” - Claude Code and Cursor Agent belong to this level. AI can independently execute dozens of operations before the set checkpoints and will only come to you for confirmation when it goes beyond the predetermined scope. Humans maintain strategic supervision but don't need to worry about every detail.

L4 (End-to-End Full Automation) is the current technological frontier. Devin, SWE-Agent, and AI Scientist are all at this level. Given a scientific research goal, it can work independently for hours or even days and produce complete results. You only need to evaluate the results at the end. Among the 17 main systems analyzed in the review, the highest level is L4.

L5 (Autonomous Setting of Research Agenda) is still a “vision” at present. Systems at this level not only execute research but also can choose what problems to research, allocate resources, and continuously accumulate knowledge over a time span of weeks to months. No existing system has fully achieved L5, but some signs have emerged. Google's Co-Scientist has partial autonomous hypothesis generation ability, and DeepMind's FunSearch has discovered real mathematical new knowledge through iterative program search.

This classification depicts a clear evolutionary path: from “helping you work” to “thinking for you,” and what kind of technological gaps lie between each level.

Core Contribution 2: Advantages and Disadvantages of Four Architecture Modes

Knowing “how autonomous the system is” is not enough. We also need to understand “how it achieves it.” The review summarizes the four current mainstream intelligent agent architectures.

Single-Agent Loop is the simplest form: a model repeatedly “plans - acts - observes - reflects.” It's like a researcher working alone. He takes action after thinking it through and adjusts after seeing the results. The advantage is that it is simple and controllable, but the disadvantage is that it is easy to reach the upper limit when encountering complex tasks. It's like one person being responsible for all kinds of work, and physical strength and attention will give out first.

Multi-Agent Collaboration is equivalent to forming a team. Different agents play different roles and review and supplement each other. MetaGPT goes further. It encodes the standard operating procedure (SOP) into multi-agent collaboration. Just like a software company, product managers, architects, engineers, and testers each have their own responsibilities and hand over through standardized documents instead of free chatting. As a result, the task completion rate has jumped from 67% to 100%.

Hierarchical Orchestration is the technical implementation of the “manager - executor” mode. A high-level agent decomposes the goal and assigns tasks, and multiple specialized sub-agents are responsible for specific execution and reporting results. Claude Code adopts this architecture. The main agent maintains the global state and high-level planning. When encountering specific tasks such as file editing or web searching, it sends out sub-agents to complete them independently to avoid irrelevant information polluting the main judgment.

Tool-Enhanced Execution is “equipping the agent with external hands and feet” - code execution environment, web browsing, database query, laboratory robot control interface... ChemCrow integrates 18 chemical-specific tools, enabling the model to upgrade from “knowing how to answer chemical questions” to “being able to actually operate chemical processes.” The correct rate of chemical questions has thus jumped from less than 30% of the original GPT-4 to 75%.

These four architectures each have their own strengths, and no one of them comprehensively outperforms the others. In reality, the most powerful systems often use them in combination: hierarchical orchestration is responsible for overall planning, tool-enhanced execution is responsible for implementation, multi-agent collaboration is responsible for quality review, and single-agent loop is responsible for specific reasoning.

Core Contribution 3: Six Unsolved Problems

The most honest part of the review is to face the still unsolvable dilemmas in this field.

Cognitive Loop Trap: The agent gets stuck in a dead loop - repeatedly performing the same failed operation without realizing that it is going in circles. AutoGPT has become notorious for this. Entering an infinite loop is its most frequently mentioned defect. Currently, there is no general systematic solution, and most “anti-loop” mechanisms rely on manual parameter adjustment for specific tasks.

Context Window Limit: The “working memory” of the model is limited. A long-term scientific research session may generate more than 100,000 tokens. The early information beyond the window range will be permanently lost. Hierarchical orchestration can alleviate this problem, but it is still difficult to truly achieve “research memory” across sessions.

Novelty Assessment: How to judge whether the research results produced by AI are truly novel? Citation prediction is interfered with by social factors, and semantic similarity cannot distinguish between “novel” and “obscure.” Currently