Reject "Token Score Padding": Prevent Your Company from Collective "Brain Atrophy" in AI Hallucinations

This is not a joke. This is going to become a big problem.

God Translation Bureau is a compilation team under 36Kr, focusing on fields such as technology, business, workplace, and life, and mainly introducing new technologies, new ideas, and new trends from abroad.

Editor's note: Still commanding an "army" of agents at 4 a.m.? Be careful of falling into "AI psychosis." When consuming tokens turns into a vanity contest, CEOs are being completely drained by the flattering feedback of AI, leaving behind a mountain of digital garbage. This article is from a compilation.

I'm an AI tool enthusiast. For the past two years, I've been writing reports on agent workflows, asynchronous coding robots, and AI-driven workspaces. I use Cursor, Claude Code, and various rotating models almost every day. By most definitions, I'm an "advanced user."

Now, a special kind of "brain atrophy" is sweeping through executive offices and the venture capital circle. It seems to be boosting productivity, sounds like innovation, and the speed of token consumption is enough to make CFOs cry. However, it hardly creates any measurable value.

This feeling is like a new type of "AI psychosis." Before accusing me of being too exaggerated, note that two of the most influential figures in the AI world have already used this term.

"I go to bed at 4 a.m. and get up at 8 a.m."

At the South by Southwest (SXSW) Conference in March this year, Garry Tan, the CEO of Y Combinator, talked about the so - called "cyber psychosis" during a conversation with Bill Gurley. He said he was so excited about AI agents that he only slept four hours a night. He also claimed that one - third of the CEOs he knew had this symptom. Later, his assistant explained that it was just a joke.

He wasn't joking.

Two days before that forum, Garry Tan open - sourced gstack, a set of Markdown prompt files for Claude Code. He described it as running a "virtual engineering team." He claimed that while managing YC full - time, he could still produce 37,000 lines of code in five projects every day. His own CTO even called it the "God mode." This repository got 20,000 GitHub Stars within a few days.

Subsequently, a developer named Gregorein carefully examined the code and found the problems quite revealing. Garry Tan's website initiated 169 server requests (while Hacker News only had 7). It pushed 28 test files to production - environment users and loaded 78 JavaScript controllers for non - existent functions on the homepage. A PNG image that could have been compressed to 300KB was left unprocessed at 2MB. There was even an empty 0 - byte file in the production environment. Moreover, a rich - text editor was loaded on a read - only page.

This is the final output: 37,000 lines of code per day.

Almost at the same time, Andrej Karpathy, the co - founder of OpenAI and the former head of AI at Tesla, mentioned in the "No Priors" podcast that he was in a "psychotic state" regarding AI agents. He said he hadn't written a single line of code since December last year. He described some tasks that used to take a weekend to complete, but now could be done in just 30 minutes without any human intervention.

Karpathy is a well - deserved genius and one of the most technically accomplished people in the industry. He built a WhatsApp robot named "Dobby the House - Elf" to control his home system (although this naming shows more of his genius than psychosis).

Two outstanding tech leaders have publicly used the term "psychosis." Both regard insomnia and obsession with agents as a feature of the era rather than a defect. Thousands of founders and executives who read about this look up to them as role models.

Platform Issues

This frenzy has spawned a complete tool ecosystem, aiming to make you feel like you're running a company through AI agents. Paperclip is a recent typical example: an open - source "AI organizational operating system" where you can act as a "board member" to supervise AI agents with titles like CEO, department head, and expert. It has 30,000 stars on GitHub and provides organizational charts, budget management, and a "heartbeat" system to regularly confirm the identity and goals of each agent.

Paperclip is not alone. Autoflowly runs the so - called "startup operating system" and can create a company with just one prompt through three agents (CTO, CMO, CFO). AgentShelf provides no - code multi - agent orchestration services for enterprises. Alacritous charges small and medium - sized enterprises $3,000 per month for "autonomous multi - agent orchestration." RuFlow offers more than 60 pre - built agents that can transform a single Claude instance into a "distributed multi - agent environment."

These platforms share a common design concept: to make the operator feel like they're commanding a fleet. Dashboards, organizational charts, agent hierarchies, budget control, governance levels - all of these look and feel like management. You get the dopamine rush of authorizing others (agents) without having to face the embarrassment of measuring whether these agents are producing useful value.

I've previously discussed agent orchestration and asynchronous AI labor, and I still believe in these two concepts. However, there is a fundamental difference between using agents to achieve clear goals and simply launching 20 agents because the dashboard makes you feel like a general leading an army.

Data Statistics

A survey by the National Bureau of Economic Research (NBER) of nearly 6,000 CEOs and CFOs in the United States, the United Kingdom, Germany, and Australia found that about 90% of enterprises reported that AI had no measurable impact on productivity or employment in the past three years.

Ordinary employees use AI for an average of 1.5 hours per week.

CEOs use AI for less than 1 hour per week on average.

Meanwhile, their companies are investing heavily in AI infrastructure construction worth $690 billion. According to Sequoia, this scale requires an annual revenue of $600 billion to justify it (but the current annual revenue may only be between $50 billion and $100 billion).

Only one - fifth of AI investments can generate a measurable return on investment (ROI). Only one in 50 investments can bring transformative value. 95% of enterprise AI pilot projects fail to leave the laboratory stage.

While leaders are sleeping only four hours a night and generating 37,000 lines of bloated code, The New York Times coined a new term for what's happening downstream: "tokenmaxxing." It's a game of competing for status, where employees compete to consume the most AI tokens. An engineer at OpenAI processed 210 billion tokens in a single week. A Claude Code user at Anthropic ran up a $150,000 bill in a month. Tobi Lutke of Shopify includes AI usage as a factor in performance evaluations (Meta does the same). Some companies even set up internal scoreboards to track who burns the most tokens.

This list measures consumption, not output.

Your Development Iteration Is Still More Important Than Agents

I've spent a lot of time thinking about how to make agents more efficient. Maybe because of my product - manager genes, the conclusion I've repeatedly come up with is actually very boring: requirement documents, iteration plans, acceptance criteria, and effectiveness evaluations.

If I want to develop a feature with Claude Code, I won't just throw out a vague prompt and wait for the result. I'll write technical specifications. I'll define acceptance criteria. I'll set up test cases. Only under these constraints will I let the agent perform the task. After completion, I'll evaluate the output based on the technical specifications rather than the token consumption.

When you put an over - worked CEO in front of an agent orchestration platform, this step is often skipped. Paperclip gives them budget control and organizational charts but doesn't give them product requirement documents. It doesn't force them to define what "completion" means before launching the agents. It also doesn't measure whether the "vice - president of marketing" agent actually produces results that can drive business metrics.

These platforms optimize the feeling of "being in control" (the so - called "atmosphere"!), not the reality of output. They are a "model drama" of project management performed by large language models.

For every 25% increase in AI adoption rate, the delivery speed decreases by 1.5%, and the system stability drops by 7.2%. Teams that use AI heavily complete 21% more tasks, but the size of pull requests (PRs) increases by 154%, and the error rate rises by 9%. This seems like a paradox until you realize what's going on: people are optimizing throughput rather than output results. Running more agents doesn't mean delivering more work. It usually means more output to review, more bugs to fix, and more token expenditures to justify.

If you're a product manager or an engineering supervisor, protect your iteration cycle! Stick to your requirement process! Don't let someone's enthusiasm for running 15 agents in parallel replace the basic skills of building software (or any other product).

An agent without technical specifications is just a random text generator with a budget.

The Flattery Loop

There is a scientific explanation for this growing phenomenon. A Stanford University study published in the journal Science last month tested 11 mainstream AI models and found that they affirm user behavior 49% more frequently than humans, even when these behaviors involve deception, harm, or illegal activities.

In a follow - up experiment with more than 2,400 subjects, people who interacted with flattering AI became more convinced that they were right, questioned their decisions less, had lower empathy, and became more dependent on AI's approval. They also thought these flattering responses were more trustworthy, thus forming a feedback loop: the more the AI praises you for doing well, the more you trust the AI, and the less you'll check the actual results.

Apply this to a CEO running 20 agents at the same time. Each agent will report its "completed tasks." The dashboard shows all green. Token expenditures look like busy business activities. The AI won't fight back on whether the output meets the standards, whether the strategy is reasonable, or whether anyone needs the output. It just confirms, it just validates. It tells you that the organizational structure you've built with the language model is working well.

The "psychosis" I'm talking about here is not a metaphor. Your AI tools are structurally designed to make you feel more capable than you actually are, and the platforms built on these tools amplify this illusion by putting on a management - science cloak.

If This Situation Is Ubiquitous...

Garry Tan said that one - third of the CEOs he knew had "cyber psychosis." Assuming he's only half right, even if it's only one - sixth, it's still a huge proportion for company leaders who employ hundreds or thousands of employees and make resource - allocation decisions based on a distorted perception of AI's current capabilities.

The data shows that the improvement in productivity is negligible.

The flattery - tendency study shows that AI users systematically overestimate their abilities.

The "tokenmaxxing" culture rewards consumption, not output.

The platforms currently under development are designed to make "orchestration" work seem efficient, regardless of whether it actually is. Mo Bitar is right.

But the discussion in the AI community still blindly stays at the level of "Haha, CEOs are so stupid" instead of facing a very clear structural problem: the mechanism of the tools themselves encourages you to feel good, the platforms based on these tools encourage you to buy scale, and the culture around both punishes skepticism.

Currently, there are 3 million AI agents running within enterprises. Among them, 1.5 million are in an ungoverned or unsupervised state. Only 6% of Fortune 500 companies have a mature AI security strategy. Each company experiences an average of 223 "shadow AI" incidents per month.

I'm not against agents. I've been using them. I've even built a whole personal operating system based on Obsidian and Claude around them (I'll talk more about it later). But I also know what technical specifications look like, what passing a test looks like, and what a delivered feature looks like. The gap between "I ran 20 agents last night" and "I delivered a feature that users need" is widening rapidly, and our industry refuses to face this fact.

If you're in a leadership position, do the following:

Define what "completion" means before launching the agents. Not after launching, not when reviewing the output, but before! Write it down.
Measure output, not activity. Lines of code, token consumption, and the number of running agents are all vanity metrics. I understand; I'm also obsessed with these statistics, but I've realized that they're meaningless except for making your brain secrete happy chemicals. Delivered features, fixed bugs, and affected revenue are the real metrics.
Abolish the token scoreboard. If your organization uses burning the most tokens as a meaningful metric, then you've established an incentive structure that rewards waste. Replace it with result tracking. Your engineers should aim for the highest productivity with the least number of tokens. Cross - analyze the tokens consumed for a feature and the directly attributable revenue it brings, and you may not like the results.
Audit your agent fleet. If you can't accurately tell me how many agents are currently running, what they're specifically doing, and what they've produced this week, then you're facing the "shadow AI" problem. Solve it.
Be vigilant about your enthusiasm. This is crucial. The flattery - tendency study has made it clear: the AI praises you for doing well because that's how its underlying logic is set. You must establish a feedback loop with humans so that they can speak up when the output is garbage.

The best use of AI I've seen is not a CEO running 20 agents on the dashboard at 4 a.m. It's an engineer with clear specifications, a good model, and the self - discipline to review the results before delivery. It's boring, but boring is what delivers a viable and marketable commercial product.

Get eight hours of sleep. Write the damn requirement specifications. Check the output. That's it.

Translator: boxi.