DeepSeek Researchers Enable AI to Self-Study: AI Writes 99% of 45

DeepSeek and GPT collaborated to write the first draft in just 76 minutes.

DeepSeek and GPT have collaborated to write a paper!

According to a report by Zhidx on May 27th, last night, Deli Chen, a senior researcher at DeepSeek, released a 45 - page paper co - written with Agent, of which 99% of the content was written by CodeAgent.

The title of the paper is "From Copilots to Colleagues: A Survey of Autonomous Research Agents", and the authors are Deli Chen, DeepSeek - V4 - Pro, and GPT - Image2.

Deli Chen also specifically issued a disclaimer: This paper is by no means a rigorous academic paper and does not represent the views of any company or organization. It is just out of interest and to test the DeliAutoResearch he built.

He revealed that the paper went through 6 iterations and took 6 days to complete, while the first draft only took 76 minutes. During this period, the agent ran about 108 rounds in total, consumed about 648,000 tokens, and the LaTeX code totaled 2,234 lines. The final product is 45 pages, including 7 icons and 4 tables, with a file size of 538KB. Deli Chen couldn't help but sigh that the same work used to take at least a month to complete, and this time his own "CPU running time" was less than 2 hours.

Deli Chen is a core contributor to the architectures of DeepSeek - V1, V2, V3, V4, DeepSeek - R1, DeepSeek - Coder, and DeepSeek - MoE. He holds a bachelor's degree in information management and a master's degree in computer science from Peking University and once served as a WeChat AI researcher at Tencent.

This paper reviews a total of 105 relevant literatures in the three major fields of machine learning, software engineering, and scientific discovery. Deli Chen said that these literatures have been verified. Its core purpose is to provide a unified analysis framework for AI agents capable of conducting independent research, and there are mainly four research results:

1. Propose a five - level autonomy grading system (L1–L5), with the levels extending from code auto - completion to fully independent research planning, providing a standardized terminological standard for the definition and comparison of various systems.

2. Analyze the four major mainstream architecture models: single - agent cycle, multi - agent collaboration, hierarchical scheduling and orchestration, and tool - enhanced execution; at the same time, build a comparative analysis framework to evaluate the advantages and disadvantages of various architectures in terms of scalability, cost, stability, and manual supervision.

3. Based on a six - dimensional feature matrix, analyze 17 mainstream systems. The research results show that current cutting - edge systems are generally at the L4 level (capable of multi - step independent execution within a limited domain), while the L5 level remains at the stage of a target concept.

4. Sort out the six major core unsolved problems: cognitive dead - loop, context window limitation, innovation value evaluation, result reproducibility, security risk, and usage cost, and provide specific research directions for each problem.

The research analysis found that the core bottleneck in achieving L5 - level autonomy is not the basic performance of the model, but lies in three major difficulties: long - term knowledge precipitation, reliable self - evaluation ability, and a large - scale solution for the agent architecture supported by theory.

Many developers have asked for the open - source code in Deli Chen's comment section.

Paper: https://victorchen96.github.io/auto_research_survey.pdf

01. Most current systems are at the L4 level capable of independently producing papers, and existing systems have shown L5 - level features

The paper defines autonomous research agents as: a type of software system that, after receiving high - level research goals, can independently execute the iterative closed - loop of scientific exploration, including hypothesis generation, experimental design, execution, analysis, and iterative optimization, and requires little or no human intervention during the execution process.

The five - level autonomy grading system (L1–L5) for autonomous research agents is based on two dimensions:

One is what content the agent can independently make decisions on, and the other is how long the agent can run autonomously without human review.

The typical representative of L1 is code completion tools such as GitHub Copilot. At this level, the agent can run a single token or a single line of text. Its core ability is to predict the subsequent content of the text written by humans, and humans completely dominate the direction, structure, and correctness of the content.

The paper mentions that the code completion model evolved from CodeX can achieve a 30% - 55% efficiency improvement in controlled coding tasks but cannot independently complete multi - step goals.

The representatives of L2 are conversational AI assistants such as ChatGPT with plugins and Claude that support tool calls. The agent can break down a clearly defined task into multiple steps and execute them, but each step requires explicit or implicit approval from humans.

Its capabilities include web search, code execution, and information integration, and the whole process requires human guidance in the conversation and verification of intermediate results.

L3 is the code agent. In this level, the agent can independently execute 10 - 100 consecutive actions and only requests human review at preset checkpoints or when encountering uncertain situations. It can independently view code repositories and edit files without human step - by - step approval.

The core difference between L3 and L2 levels is that the agent can make independent decisions, such as choosing which file to edit and how to fix test failures, without obtaining human approval step by step; humans only retain the right of supervision.

The representatives of L4 are systems such as AI Scientist, Devin, and SWE - Agent. They can independently generate research ideas, write papers, run experiments, produce complete papers, and even complete automated peer reviews without human intervention throughout the process.

After receiving the research goal, agents at this level can run independently for hours to days, including independently recovering from failures, iteratively optimizing strategies, and finally producing complete research results. Humans only need to evaluate the final output results and do not need to supervise the entire execution process.

L5 is the highest level of autonomy. The agent can not only execute research tasks but also independently select research problems, allocate resources among multiple projects, and continuously iterate based on past results.

The research shows that no system has reached this level yet. The agent Voyager, which can independently generate learning courses for tasks of increasing difficulty, and the agent FunSearch, which can iteratively discover new mathematical structures based on past successful programs, have shown some features of L5.

02. The four major mainstream architectures can be adapted to systems at different levels

The paper summarizes four major mainstream architecture models: single - agent cycle (ReAct/Reflexion), multi - agent collaboration (MetaGPT/AutoGen), hierarchical orchestration (Supervisor - Worker), and tool - enhanced execution (CodeAct).

Single - agent cycle (ReAct/Reflexion): This is the simplest and most widely used basic architecture for autonomous agents. A single language model iteratively executes the closed - loop process of "observing the environment → inferring the next action → executing the action → absorbing feedback", and it is the core architecture of most L3 - L4 level systems.

Although the architecture design is simple, it is the core framework of most L3 - L4 level systems, and there is a lot of room for optimization and change in the reasoning strategy, with extremely strong adaptability.

Multi - agent collaboration (MetaGPT/AutoGen): A multi - agent system can split the task responsibility among multiple specialized agents and complete the goal through communication and collaboration among agents.

Hierarchical orchestration (Supervisor - Worker): As the task complexity continues to increase, the flat multi - agent communication model will gradually become ineffective. Hierarchical orchestration introduces a clear supervision and control relationship: a high - level supervisor agent is responsible for breaking down the task, assigning subtasks to specialized executor agents, monitoring the task progress, and intervening and adjusting when necessary.

Finally, tool - enhanced execution (CodeAct): This is the core and iconic feature of autonomous research agents, which is their ability to interact with external tools and the external environment. Tool - enhanced execution transforms the language model from a passive text generator into a participant in computational and physical workflows. Coupled with its ability to connect to code, experiments, and web pages, it has the highest upper limit of capabilities.

Generally speaking, L2 - level systems can run efficiently with a simple single - agent cycle. L3 - level systems can benefit the most from using Reflexion, which can naturally embed a checkpoint mechanism. L4 - level systems usually require a hierarchical orchestration architecture combined with autonomous iterative optimization to maintain output quality during long - term autonomous operation. The theoretical L5 - level system will probably require a graph - structured architecture with self - reorganization ability to be realized.

03. Three major conclusions: The gap between open - source and closed - source systems is narrowing, domain - specific agents outperform general agents, and code agents are the most mature

Based on a six - dimensional feature matrix, the paper analyzes 17 mainstream systems. The six - dimensional features include the L1 - L5 autonomy levels, core application fields, architecture models, tool integration breadth, evaluation methodologies, and open - source attributes mentioned above.

It reaches three major conclusions:

First, systems that focus on a specific field have a higher upper limit of capabilities. Among them, code agents perform best in all dimensions. Benefiting from the support of an automated evaluation system, a mature tool environment, and large - scale benchmark tests, it is the most mature track in the current industry.

Second, domain - specific agents comprehensively outperform general agents. L4 - level systems such as SWE - Agent, Coscientist, and FunSearch have achieved stable output by narrowing the application scope. General agents such as AutoGPT and BabyAGI have always been unable to achieve stable L4 - level operation in diverse tasks.

Finally, the gap between open - source and closed - source systems is narrowing. The performance of the open - source system OpenHands is already very close to that of closed - source systems such as Devin.

In terms of the evaluation system, the paper mentions that three core directions need to be focused on:

Multi - dimensional indicators: Jointly evaluate innovation, correctness, efficiency, and security instead of optimizing in a single dimension; Long - term evaluation: Track the performance of agents in long - term scientific research projects instead of isolated single tasks; Community - based evaluation: Embed expert feedback loops into the evaluation process and establish industry - consensus evaluation standards.

The paper also lists six major core unsolved problems for agent systems: cognitive dead - loop, context window limitation, innovation value evaluation, result reproducibility, security risk, and usage cost.

Among them, cognitive dead - loop, originality evaluation, and security issues are the most critical. Therefore, the cognitive loop problem makes the agent still unable to recognize that it is in trouble and will only persist in the failed strategy instead of looking for a new method; in addition, there are no reliable automated indicators to measure the quality and originality of scientific research results, resulting in the agent's inability to achieve self - improvement in the closed - loop; finally, as the agent's capabilities improve, its security boundaries and ethical risks become more prominent.

04. Conclusion: Two AIs collaborate to produce a complete paper, and agents have truly become scientific research colleagues

Deli Chen's experiment this time enabled the agent to independently produce a complete paper from an idea. He only invested 2 hours of human thinking time, and through the collaboration of two AIs, an AI scientific research review paper was produced, which proves the feasibility of AI evolving from a tool to a "scientific research colleague".

Facing complex work with a long - term cycle and long process, the paper finally generated by AI has clear logic and does not deviate, demonstrating the core capabilities of long - text processing, long - process continuous execution, and unified logic throughout.

In the field of scientific research agents, Deli Chen not only demonstrated the capabilities of scientific research agents through an interesting experiment but also

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

DeepSeek researchers let AI study itself. AI wrote 99% of a 45-page paper in 6 days.

01. Most current systems are at the L4 level capable of independently producing papers, and existing systems have shown L5 - level features

02. The four major mainstream architectures can be adapted to systems at different levels

03. Three major conclusions: The gap between open - source and closed - source systems is narrowing, domain - specific agents outperform general agents, and code agents are the most mature

04. Conclusion: Two AIs collaborate to produce a complete paper, and agents have truly become scientific research colleagues