HomeArticle

A doctor spent 80 hours staying up late to modify code, while Codex finished it in 2 hours. The singularity of scientific research has arrived.

新智元2026-05-13 19:43
In Codex Goal mode, it only takes less than 2 hours to complete an 80-hour doctoral task.

[Introduction] Today, Agentic AI engineers discovered that Codex can complete a research task that would take a doctor 80 hours in less than 2 hours, resulting in an efficiency difference of 40 times! In fact, according to the old standards, AGI has already existed; it's just that the entire industry is moving the goalposts.

The "singularity" in the scientific research community has truly arrived, much closer than anyone expected.

Recently, an experiment on Codex's "Goal Mode" shocked the academic community: Codex can increase the efficiency of AI scientific research by 40 times!

Agentic AI engineer Dan McAteer recently disclosed an experiment on X: using OpenAI Codex's Goal Mode to run a research task on Mechanistic Interpretability.

GPT-5.5 estimated that a doctor might need about 80 hours to complete this task. However, in actual operation, the AI completed the task in just 1 hour and 56 minutes.

The surface efficiency has increased by about 40 times!

He used a built-in skill in Codex, /goal.

And the author believes:

/goal + gpt-5.5 high precision + fast mode is the most efficient AI agent configuration today.

That is, let the model set its own goals. The key is that the prompts it writes are likely to be better than yours.

This is no longer a simple "efficiency improvement" but a complete "dimensionality reduction strike."

When the scientific research cycle is reduced from "weeks" to "hours" and AI starts to write its own experimental goals (/goal) independently, we must admit a cruel reality:

The slope of the "intelligence explosion" has emerged, and the self-iteration speed of AI is getting out of human control!

What exactly is the Codex /goal mode?

Let's first see how this experiment was carried out.

The initiator of the experiment is Dan McAteer, an Agentic AI engineer and a former Amp Code engineer.

He has been sharing the specific practices of AI agent engineering on X for a long time.

His experimental configuration is very simple:

  • Tool: OpenAI Codex /goal command
  • Model: GPT-5.5 high
  • Mode: fast mode
  • Task: a research task in the direction of Mechanistic Interpretability

His description of this configuration is: the most efficient AI agent configuration currently available.

Why is Codex /goal important?

What's really worth talking about is the Codex /goal mode itself.

According to OpenAI Codex engineer Philip Corey, /goal is an implementation of the Ralph loop - making the goal persist in multiple rounds of dialogue and not stop until it is achieved.

Simply put, in a normal Codex call, you say one sentence, it takes one step, and then replies to you.

With Codex /goal, you state a goal, and it breaks down sub-tasks by itself, executes them, reviews them, and continues on its own until the goal is achieved or fails.

This is an engineering switch from conversational AI to goal-driven AI.

For research tasks like Mechanistic Interpretability, the /goal mode has a high natural fit.

The research process itself is a cycle of proposing hypotheses - designing experiments - running - looking at results - revising hypotheses - conducting experiments again, which can be fed to a self-circulating agent.

What McAteer's experiment really proves is that the Codex /goal mode is available for scientific research cycle tasks: not to replace researchers, but to replace the repetitive operations of researchers.

If this ability can be stabilized, it will have a very direct leverage on AI research itself.

It means that AI researchers in AI laboratories can use AI agents to do repetitive tasks such as training data preparation, experiment setup, ablation studies, visualization generation, and preliminary result analysis in the future.

This is what Anthropic and OpenAI have been repeatedly saying: AI is accelerating AI research itself.

80 hours for a doctor vs. 2 hours for AI

In the traditional scientific research context, a doctoral student's daily routine is: reviewing literature, building models, debugging code, verifying results, and writing reports.

This process is long because the human brain has physical limits when dealing with complex logic and massive data.

However, this experiment with Codex completely shattered this perception.

Under the strongest agent configuration of "/goal + GPT-5.5 High + Fast Mode", AI is no longer just a "command-following" tool but an independent researcher that "comes up with strategies."

It can understand the complex requirements of natural language autoencoder (NLA) experiments, break down tasks independently, and complete in less than 2 hours what would take human elites two weeks to accomplish.

This means that the threshold for human scientific research has completely collapsed. The professional analysis ability that used to require years of hard study is now being modularized by algorithms.

Moreover, the autonomous AI researcher has arrived ahead of schedule!

OpenAI previously set the goal of achieving autonomous AI scientific research by the end of 2026.

However, judging from the current experimental progress, 2026 may not be the beginning but the end when humans completely hand over the scientific research baton.

Recursive self-improvement is emerging

If the 40-fold speed experiment with Codex is a striking individual case, what's even more worrying is that evidence surrounding "recursive self-improvement" is emerging densely.

On May 7th, according to Axios, Jack Clark, co-founder of Anthropic, publicly gave a probability:

By the end of 2028, the probability of AI achieving complete recursive self-improvement is over 60%.

The research teams of Sakana AI and UBC created the Darwin Gödel Machine this year, a programming agent that can rewrite its own source code to improve its capabilities.

Paper link: https://arxiv.org/abs/2505.22954

On the SWE-bench, its score self-improved from 20.0% to 50.0% without any human intervention.

The AI Scientist project of the same team was published in Nature in March this year.

It can generate research ideas, write code to run experiments, write complete papers, and conduct peer reviews on its own.

AI independently completes the entire scientific research pipeline from start to finish.

Let's look at some hard data. GPQA Diamond, a scientific Q&A benchmark set by doctoral experts. In November 2023, GPT-4 scored 39%. The average level of human domain experts is about 65%.

In April 2026, the cutting-edge models collectively crossed the line: Gemini 3.1 Pro scored 94.3%, and Claude Opus 4.7 scored 94.2%.

All cutting-edge models have far outperformed human doctoral experts.

The trajectory of the SWE-bench better illustrates the acceleration.

At the end of 2023, Claude 2's pass rate was 2%. Now, it's 93.9%.

In just two and a half years, it soared from 2% to 93.9%.

Anyone who has studied high school mathematics can recognize the shape of this curve.

Obviously, the process of recursive self-improvement (RSI) has begun.

Once AI starts to rewrite its underlying code and optimize its architecture with this 40-fold efficiency, the growth of intelligence will no longer be linear but vertical.

AGI has been delivered, and the entire industry is "gaslighting" you

Actually, as early as February this year, four scholars from different top fields co-authored a paper that can be called "the most disturbing of the year": "Case Study of AGI: Today's LLMs Have Met the Standards."