A Harsh Preview of the Workplace: As AI Learns to Evolve Automatically, the Survival Space for Entry-Level Talents is Being Squeezed by the "200-Hour Agents"
God Translation Bureau is a compilation team under 36Kr, focusing on fields such as technology, business, the workplace, and life, and mainly introducing new technologies, new ideas, and new trends from abroad.
Editor's note: When AI can complete 200 hours of work instantaneously, humans become the "bottleneck" instead. This METR drill reveals that in the future, execution ability will rapidly depreciate, and the efficiency of human judgment and feedback will be the only decisive factor. This article is from a compilation.
Introduction
METR aims to keep the public informed about the capabilities of AI and the risks it brings. By some measures, AI is the fastest - developing technology in history, and as AI automates its own research and development (R&D), this process may accelerate further. By the end of next year, the frequency of model releases and the number of new evaluations (evals) required may reach a level where it will be a challenge for us to obtain information on our own without efficient AI assistance. We can't wait until such workflows become essential to passively explore AI - enhanced work models; we need to start understanding them now.
Therefore, we conducted a two - hour desktop drill: three METR researchers played themselves in reality and their current work focuses, but assumed that they could use an AI that could work continuously for about 200 hours - which is roughly our expectation for the technological level in 12 to 18 months. Our goal was to understand what workflows would be generated, where the bottlenecks were, and how much our actual efficiency could be improved.
Drill Process
Scenario
Simulated World
METR has an AI with a 200 - hour time span to automate our work; while the rest of the world uses the real technology from February 2026 (an AI with a time span of about 12 hours).
We have the Codex/Claude Code version suitable for a 200 - hour time - span AI and a basic project management workflow.
But the current situation is February 2026, so we are evaluating the AI from 2026, using the 2026 version of Inspect, and communicating with people via email, etc.
AI Capabilities
The AI now has a time span of about 200 human - hours, but its relative ability characteristics are similar to those of the AI at the beginning of 2026.
They perform amazingly on verifiable tasks and reasonably well on complex and messy tasks.
The running speed of the AI is twice that of Claude 4.6 Opus in fast mode. We can afford to run the model at this speed.
For verifiable tasks with an average "complexity" comparable to HCAST tasks, a workload of 200 human - hours corresponds to a 50% success rate, and 40 human - hours corresponds to an 80% success rate.
For tasks that are difficult to verify, the game master (GM) decides the degree of the AI's success.
In terms of writing, if the relevant context is available, the AI's level is equivalent to that of an entry - level employee at METR.
Gameplay
A manager and two researchers played themselves in reality and their current work focuses. I (Thomas Kwa) served as the host.
Each round represents half a day, and there are two stand - up meetings per day. Each round takes 15 minutes in reality: 5 minutes for the stand - up meeting and 10 minutes to simulate 5 hours of work. We finally completed 4 rounds (simulating 2 days). [1]
Everyone recorded simultaneously in a spreadsheet, filling in their own and the agents' operations every hour and consulting the host when necessary. You can see a screenshot of the spreadsheet below.
Figure 1: Nate Rush is frantically sending prompts to a future version of Claude to improve our human data infrastructure. On the second day, he will realize that just understanding what Joel and Tom's agents have built is already overwhelming for him.
Thomas Kwa's Observations
How much did our efficiency improve?
Most people estimated that compared to February 2026, the efficiency increased by about 3 to 5 times (i.e., completing 1 to 2 weeks of work in these 2 days). I don't want to overemphasize this number because it may be affected by an optimistic assessment of the actual amount completed, and there are huge differences between different teams. I think the qualitative conclusions are more interesting. Under these premises, I noticed that if a model with a time span 17 times that of the February 2026 model only brings a 3 - fold increase in efficiency, then the relationship between the time span and the acceleration ratio is approximately ($Acceleration\ ratio\propto TH^{0.39}$).
What was the actual experience like?
In this 3 - person game and the two single - player Alpha tests I ran before, some common themes emerged:
Thoughts are not executed as fast: Once you have an idea, the agent will start implementing it immediately. Therefore, instead of conceiving continuously for several days, you can create a minimum viable product (MVP) and make corrections within a few hours. If the task is not close to the limit of the agent's ability, you will spend all your time understanding the results; if the task is challenging, you will spend all your time checking its work.
Let the agents work overnight: At night, the agents can complete about 200 human - hours of work, but only for tasks that are very suitable for agents. Therefore, researchers need to deliberately arrange the project sequence to ensure that ultra - long tasks suitable for agents (such as optimizing a well - defined indicator) are carried out at night.
Priority sorting and organizational management become bottlenecks: If the agents can execute ideas almost as fast as you can input prompts, then it doesn't make sense to implement only the best ideas. It may be better to implement the top three ideas in parallel, but this will increase the difficulty of staying organized. Even with dashboards written by AI to optimize human understanding, the complexity of the project may increase in some way, making project management much more difficult.
Workflows
Based on this drill, I foresee the following trends (of course, predicting the future is extremely difficult):
Declarative workflows: I have already completed most of my work by writing design documents and letting the agents implement them, which allows both me and the agents to keep the progress in sync. In the next year, this may evolve into the "write down your local utility function" workflow mentioned by Tom Cunningham below.
Speculative execution: To prevent serial bottlenecks (see the next section), researchers may use two forms of speculative execution: starting a large number of long - term experiments for projects whose necessity is uncertain, and predicting experiment results and feedback (see the "Agents can relieve bottlenecks" section of Tom Cunningham).
"Proof of correctness": If the agents are still not 100% reliable, then the most valuable output form generated by the agents will be to prove to humans that their code complies with the specifications. This may include testing, writing to improve reproducibility, recording the specific implementation location of each line in the design document, and in extreme cases, formal verification.
Bottlenecks
If execution basically becomes instantaneous, what else will happen? Originally, serial time - consuming tasks that were parallel to execution will no longer be able to be parallelized, but will become serial bottlenecks. Most of the total project duration may be occupied by links such as human data, machine - learning experiments, and feedback (from peers, managers, especially external consultants).
Figure 2: We may face nested iterative loops, where the "inner loop" of execution is much faster than the "outer loop", and the project progress will be stuck by steps that require a certain amount of serial time. For tasks that agents are good at, this is already a fact and may extend to almost all projects.
I imagine that the timeline of a future METR project (such as a paper on the disruptive capabilities of multi - agents) will be as shown in the following table (the text description is in footnote [2]). It may take six weeks of natural time, with about 8 hours of agent workload (not counting the time for running evaluations), which means that the ratio of the bottleneck time to the agent workload far exceeds 100:1.
Figure 3: Future projects may require about 42 natural days, including about 8 hours of agent workload (excluding evaluation runs) and 1000 hours of serial time for human IC work, evaluation execution, and review. In reality, humans may adapt to the new limitations, so the project timeline will not be exactly like this.
People may carry out multiple projects in parallel, and the agents will brief them on the status of each project. When the number of projects is so large that the task - switching cost is too high, individual human contributors may slightly improve the quality of each project through additional work.
Some organizations will face huge competitive pressure and have to streamline the review process and increase the serial speed of experiments.
Subsequent Iterations
Everyone enjoyed this drill: two participants gave a score of 9/10, and one even gave an "11/10". I hope this can become a regular drill at METR - for example, held once a month, rotating among the propensity team, the capabilities team, the operations team, and the whole company.
If we run it again, I will try some other variations:
A version with a 50 - hour time span to guide METR's operations in the next quarter. This needs to be up - to - date before we run it.
Imagine a version where we can fully utilize the 200 - hour TH AI infrastructure. This requires everyone to use more imagination.
A version for AI R & D research. Understanding the bottlenecks when the work approaches automation and roughly estimating future efficiency improvements can provide references for timelines and take - off models.
A version that can better simulate the output of researchers on multiple parallel projects. The current version allows task - switching in hourly increments, but switching tasks every few minutes requires a higher resolution.
Tom Cunningham's Observations
We spent 2 hours on Thomas Kwa's drill: assuming that we had a very powerful AI (a 200 - hour time span), but everything else remained the same: our work was still to study the various capabilities of the February 2026 model, and everyone else in the world was still using the technology from February 2026.
My time was spent on: (1) writing down the goals I wanted to achieve; (2) providing feedback on the output.
I was thinking about whether I still wanted to do data analysis and write reports, and how I would use the powerful AI to achieve this. The workflow I conceived was: (1) write down my overall goals; (2) the agent drafts the output based on these goals; (3) I provide feedback on the output; (4) return to step 2 with the updated goals.
Example of a goal: "Give me an optimized benchmark test table. The columns should include content related to selecting a third - party risk assessment benchmark. I want to be able to distinguish which information is certain and which is speculative. Make it self - verifiable, for example, show checkmarks or crosses based on the audit results of each statement by an independent agent."
I have already used agents to do similar things, but in this case, I expect the reliability to be improved by several levels. Instead of saying "I hope this chart can be clicked", I would say "I hope this report is readable, comprehensive, quantitative, and verifiable".
We will be trapped by the bottleneck of human feedback.
After in - depth thinking, I quickly encountered other bottlenecks: (1) starting new running tasks; (2) getting feedback from others.
The bottlenecks can be relieved by agents.
Once you can use agents to automate most of the work, it seems that you will encounter bottlenecks in the non - automated part. But in fact, the non - automated part is usually predictable, which relieves the bottlenecks.
Imagine each report containing the following:
The agent's optimal prediction of the comments that Beth, Hjalmar, Ajeya might give.
The agent's optimal prediction of the survey results (if you initiate a survey).
The agent's optimal prediction of the benchmark test results.
The agent's optimal prediction of the reaction on Twitter.
In addition, you can click to see the reasons for each prediction made by the agent. I think these will significantly relieve the bottlenecks. I can iterate continuously until the information received from the outside world (human feedback, data, surveys) has the maximum amount of information, and then send it for review.
I felt like a principal investigator (PI).
I thought of two analogies: a PI in a research laboratory or a partner at McKinsey.
Both spend their time reviewing others' output, providing advice, and waiting for the next round of review.
This setup is very efficient, but it also has pathological drawbacks. I think many PIs don't have time to understand detailed statistical or conceptual arguments, so doctoral students and post - docs have no motivation to check these arguments, and ultimately the laboratory may produce some superficial papers.
However, for