Software is eating the world, and this time it's for real.
God Translation Bureau is a compilation team under 36Kr, focusing on fields such as technology, business, the workplace, and life, and mainly introducing new technologies, new ideas, and new trends from abroad.
Editor's note: Stop treating AI as just a chat tool. Agents are taking over real - world jobs by burning tokens like crazy, and a trillion - level "inference tsunami" has silently detonated. This article is a compilation.
In 2011, software ate the world. At least that's what Marc Andreessen told us. But if that's true, why does the San Francisco Bay Area still exist? If software really ate everything, shouldn't we all have moved to New York or Miami by now?
Actually, let's see what software has really eaten: Banks have apps, the retail industry has websites, hospitals have introduced Electronic Health Record (EHR) systems, and taxis can be dispatched with just a few taps on the screen, eliminating the need to call a cab at 2 a.m. when you can't even remember where you are.
Software has eaten the interfaces, but what about the actual work? Most of it is still done by humans.
When a customer calls with a billing dispute, the software transfers the call, pops up the account interface, and records the processing result after the call. But throughout the process, it's still humans who listen, judge whether the refund policy applies, decide how to handle it, and actually communicate with the customer. When a loan officer reviews a loan application, the software shows the credit score and pulls up the documents on the screen, but it's the loan officer who reads these documents and makes the final decision. For the past 15 years, software has been very good at playing the role of a "pipeline," while the core actual work is still borne by humans.
Now, AI can really do these jobs! A customer service call is turning into an agent loop: The system handles voice recognition, queries the account through the API, retrieves relevant policies, infers whether the customer is eligible, triggers a refund, and uses Text - to - Speech (TTS) to respond. An insurance claim is evolving into document entry, followed by policy limit verification, fraud marking, reserve calculation, and settlement processes, all running automatically as code. A programming task can now involve 30 rounds of file reading, code modification, test runs, and optimization, all without human intervention.
Each of these workflows is essentially a piece of software performing tool calls in a loop. If you're an inference service provider looking at logs, you'll find that a customer service agent handling a billing dispute looks no different from a programming agent fixing a bug. They're all code.
So, software is eating the world again, and this time, inference is really eating jobs. The workloads being eaten are essentially just "state transitions" and "exception handling" in human guise: customer service calls, insurance claims, loan approvals, medical administration, and legal analysis. Each task burns thousands of tokens in dozens of steps, and often multiple models run simultaneously. The entire inference market processes tens of trillions of tokens every day, and it's growing exponentially: There are more users, more workflows are turning into code, and as model capabilities improve, the number of tokens consumed per task is also skyrocketing.
Which workloads will be eaten
When a job is essentially just state transition plus exception handling, it will be absorbed by code. People doing this kind of work may have titles like "claims adjuster," "loan approver," or "revenue cycle specialist," but if you observe what they do all day, it's basically: check the input, compare it against a set of rules, decide which category it belongs to, perform an operation, handle odd exceptions, and then move on to the next. If the input can be captured as text, voice, or a document, the intermediate state is stored in a database, and the output is something like "update this record," "send this message," or "trigger this API call," then the entire process can and will definitely run as an agent loop.
A key condition that determines how deep this loop can go is verification. In programming, an agent can loop autonomously for 30 steps or more because verification is instant and free. A test either passes or fails.
But in fields such as new drug development, real verification requires wet labs that take weeks or months; or in the robotics field, the "sim - to - real" gap is still a bottleneck, and the agent's loop will hit a wall. Over time, these fields will consume more inference, but there's a ceiling on how long the loop can run because it has to wait for the physical world to catch up.
When a customer service agent resolves a ticket, the verification criteria are "Did the API call succeed? Was the record updated?" or even "Was the user satisfied with the answer?" When a loan approval agent processes an application, the verification criteria are "Does this document meet the requirements? Did the compliance check pass?"
I think most people greatly underestimate how much inference resources these transformed workflows actually consume because in their imagination, it's still one model, one call, one answer, maybe with some hallucinations in between. But the reality is quite different.
Take a voice customer service agent handling a seemingly simple but practical task, like rescheduling a doctor's appointment, as an example. To the customer, it feels like an ordinary conversation. But at the underlying level, it's a small autonomous system running continuously. When the caller speaks, the voice recognition model transcribes the audio in real - time. Then, an orchestration model infers from the transcribed content, retrieves the patient's file, checks time restrictions, queries the doctor's availability, decides what to ask next, and calls relevant tools. Once it has enough information, it integrates the results into a response, which is then converted into natural speech by the Text - to - Speech model. Meanwhile, other models may be monitoring emotions, conducting compliance checks, or deciding whether to transfer the call to a human customer service representative.
The system takes care of all the work on its own: listening, retrieving, making decisions, calling tools, verifying, and responding in a loop. An 8 - minute call may only contain about 3000 tokens of raw transcribed text, but when you factor in the repeated inferences on the ever - growing conversation content, retrieved context, and tool outputs, plus the continuous ASR (Automatic Speech Recognition) and TTS (Text - to - Speech) inferences throughout the call, the orchestration layer can easily consume about 40,000 tokens. A so - called "AI phone call" is actually a continuously running multi - model inference stack.
Which fields are on the rise
The above categories have been deployed in production environments and are in practical use, and their workflows have been significantly replaced by code. But there's a second - tier market where the same dynamics are starting to emerge, just at an earlier stage on the development curve.
The legal field is a good example. The first wave of legal AI mainly stayed at the search level: finding relevant cases, identifying risk clauses, and summarizing contracts. This is very useful, but it's relatively shallow from an inference perspective. The emerging agent - based work is much more substantial. Imagine a Mergers and Acquisitions (M&A) due - diligence agent that can read all the materials in an entire data room, cross - reference purchase agreements with disclosure schedules and due - diligence materials, identify inconsistencies, write risk memos with citations, and propose revisions. This new type of legal agent is a long - term workflow running on a massive corpus, and it produces real work results that a junior lawyer would otherwise take days to assemble. This shift is from "helping humans find information" to "doing the analysis itself," pushing the task from a few lightweight retrieval calls to a deep, multi - step loop.
Finance, accounting, supply chain, government affairs processing, and procurement all have similar forms: a large number of documents, a large amount of exception handling, a large number of intermediate decisions, and more complex verification than people imagine. These are the categories most worth paying close attention to in the next few years because they're at the critical point of model capabilities. As task horizons continue to expand, more and more of this middle - ground area will be eaten by code.
The Token Ladder
Open Claude Code on a real source code repository and ask it to fix a bug, for example: "There's a race condition in the authentication process that only occurs under high load." Before it can do anything truly useful, the agent needs to read: relevant source files, test cases, configurations, and maybe some logs. This can easily generate about 60,000 tokens of context right at the start.
Then it enters a loop. It reads the failed test, checks the authentication module, makes a hypothesis, modifies the lock logic, runs the test on the new code, gets a new error (that's progress!), revises the hypothesis, and tries again. Sometimes, the fix even breaks something upstream, and the agent has to re - read a dependency it touched before. It repeats this process over and over: read, modify, run, check, revise. After 30 iterations, the test finally passes, the code diff is cleaned up, and the entire test suite runs one last time.
This takes 3 minutes in real time, and you may have burned about 900,000 tokens.
Surprisingly, the visible output is minimal. There may be only about 500 tokens of actual code and a short explanation. The remaining about 899,500 tokens are all consumed by the loop mechanism: replaying the accumulated context, receiving the latest tool output, inferring what to try next, and carrying forward all the historical records needed to maintain coherence. The answer is concise, but the work is expensive.
Compare this with how the same model answered a direct question before the agent era, like: "What's the difference between optimistic locking and pessimistic locking?" This would only take about 900 tokens in total.
The agent loop can increase the inference demand by about three orders of magnitude. I see it as a "Token Ladder," and every task in the economy is at a certain position on this ladder.
At the bottom is ordinary conversation: one question, one answer, no tools, about 900 tokens.
One step up is retrieval: the model searches through several documents, reads them, and writes a comprehensive review. Now it consumes nearly 7500 tokens. Most of the overhead is not in the answer itself, but in the model reading and replaying the retrieved context.
Next up is customer service. A basic FAQ robot can be relatively lightweight. A more agent - based customer service system that can check your account, retrieve relevant policies, infer eligibility, and actually perform operations behind the scenes is much heavier. The surface - level interaction may look the same, but the inference profile is completely different.
Then comes programming, which is near the top of the ladder. A bounded bug fix can soar to hundreds of thousands of tokens. A real debugging or feature - development session can run up to nearly a million tokens. Anthropic's Claude Code documentation states that an active developer can burn about $13 on inference per day, which is equivalent to about 1.5 million to 3 million tokens per day, depending on the model combination and caching.
The pattern hidden at the underlying level is simple: Tokens per task = Initial context+(Number of steps × Tokens per step)
At the very bottom of this ladder, there's only one step, almost no context, and no tools. At the top, there are dozens of steps. Each step replays the ever - increasing historical records, introduces the latest tool output, and infers from it, and finally spits out only a tiny amount of visible work results. That's why there's such a big gap between the rungs of the ladder.
Why the ladder is still climbing
METR (the maker of the most important chart in the world) has been measuring how long cutting - edge models can handle multi - step tasks autonomously. They calibrate their benchmark tests against the time it takes human experts to complete the same tasks, so the results are in units of time: a model "can handle a 30 - minute task" means it can reliably complete a task that would take a skilled human about 30 minutes.
The development speed of this curve is absurd. GPT - 4 can handle tasks for about 4 minutes. Claude 3.5 Sonnet reaches about 11 minutes. Claude 3.7 Sonnet extends to about an hour. o3 reaches about 2 hours. GPT - 5 lands around 3.5 hours. And the latest cutting - edge models (such as Claude Opus 4.6) are approaching 12 hours. This is equivalent to an approximately 180 - fold increase in the time span of autonomous tasks in just two years. According to METR's measurement, the doubling period of this time has been about 131 days since 2023.
Why is this crucial for inference demand? Because a longer task horizon not only means "the model has become smarter," but also means the model can stay in the loop longer. A model that can only handle a 4 - minute task may just read a little context, take a few actions, and then stop. A model that can handle a 4 - hour task can read, call tools, check outputs, revise plans, and keep going until it completes a substantial task. Each additional loop means more context replay, more tool outputs, more intermediate states, and more inference. Therefore, as the capabilities improve, the number of tokens consumed per task also rises (usually in a super - linear fashion).
You can see this in specific workloads. In customer service, in 2023, a basic FAQ robot may only consume about 3500 tokens to handle a ticket. Better retrieval technology pushed this number up, and then tool usage and inference increased it further. Now, the full voice - support stack consumes even more. Programming follows the same pattern, but more dramatically: in the past, a limited programming task only required tens of thousands of tokens, but now, as agents become powerful enough to handle real debugging, refactoring, and multi - file collaboration, this number has become hundreds of thousands or even well over a million. Now, every valuable task can justify much more inference overhead than a year or two ago because the models can actually get the job done.
This is a subtle version of the Jevons paradox. For cutting - edge models, the price per token is actually rising, not falling. But the value brought by every million tokens is growing much faster: today's cutting - edge models can complete a whole set of workflows in a coherent session, while a year ago, it might have taken dozens of flawed attempts or might not have been possible at all. Even though the nominal token cost is rising, the actual effective cost of getting useful results each time is falling. This dynamic has opened up new blue - ocean areas: complex insurance claims, extensive code refactoring, long - running research tasks, and multi - step logistics and administrative processes. Two years ago, these weren't even effective components of the inference market because the models at that time couldn't maintain coherence long enough to complete them.
The aggregated data shows that all this is already happening. As of April 2026, OpenAI's API processes over 15 billion tokens per minute, up from 6 billion half a year ago. Google has increased from 9.7 trillion tokens per month to 480 trillion tokens in a year, a growth of about 50 times. OpenAI revealed that the inference token consumption of each enterprise organization has increased by 320 times year - on - year. Anthropic's latest reported annualized revenue has reached $30 billion (it was only $10 billion at the beginning of this year...), which speaks for itself, especially considering that its core growth drivers are Claude Code and their API.
This explosion is the result of the superposition and compounding of three curves: more users, more tasks routed to the model by each user, and as the models can handle longer and deeper workflows, more tokens are consumed per task. If the number of users triples, the number of tasks per user also increases, and on top of that, the token consumption per task rises again, then the overall demand will grow much faster than any of these curves in isolation. Google's approximately 50 - fold year - on - year token growth is a real - world manifestation of this compound effect.