Multi-Agent Popularity Soars, Yet AI Organizational Issues Remain Unaddressed

A larger agent swarm requires training in machine organizational psychology as a foundation.

In 2026, apart from harness, one of the hottest concepts in the Agent field is multi - agent, and even Agent swarm.

Codex, Claude Code, Cursor, Devin, Kimi, Manus. Almost all AI companies are moving in this direction. As tasks become more complex, the capabilities of a single Agent are insufficient to cover all aspects. So, a group of Agents are deployed. If a single Agent is too slow, multiple Agents work in parallel.

This is how humans work. A company is not a super - employee but an organizational system. Project managers break down tasks, engineers write code, testing teams check for bugs, and legal and security teams provide support.

If humans can handle complex problems through organizational division of labor, why can't AI?

In the past year, the industry has indeed made great progress in this regard. Thanks to the progress of harness and the upgrade of model capabilities, tasks can be broken down, concurrency can be isolated, permissions can be controlled, errors can be reviewed, and logs can be tracked.

This allows people to deploy Agents in batches to perform tasks.

However, deeper problems still exist.

In a series of studies in the past year, we found that when Agents gather together, they not only collide, compete for locks, and overwrite code. They also behave like human organizations, following the crowd, catering to others, shifting blame, reaching premature consensus, having wrong socialization, and even showing a gap between public expression and private judgment.

This year, we even found that this is not just a flat list of problems but a crack that goes deeper.

The purpose of this article is to clarify the several layers of problems that the multi - agent structure is currently facing. Let's see how deep the crack under the cooperation of agents really is.

01 Layer 1: External organizational problems handled by harness

The first problem that multi - agent encounters is how Agents can collaborate to ensure the completion of tasks.

You can't let dozens of Agents act freely in the same repository. There must be an effective organizational form that can unite these Agents that work independently.

The so - called harness is to put a group of unstable execution units into an external organizational structure.

The Planner is like a project manager, responsible for breaking down tasks; the worker is like an executive employee, responsible for specific tasks; the session log is like a meeting record, recording the process; the shared filesystem is like a shared file cabinet, storing intermediate results; the review queue is like the final review desk, holding back the final output until human inspection.

In simple terms, harness builds a corporate system for machines.

However, with the architecture in place, information flow becomes the core issue.

The long - range coding agent research of Cursor can well illustrate this problem. At first, they tried to let multiple Agents collaborate equally, using a shared state file to record what each Agent was doing. Each Agent read the state, picked up tasks, and updated the state. To avoid task competition, they added locks.

However, this simple method doesn't work well.

Agents may hold the lock for too long, forget to release the lock, or add a lock when it's not necessary. Even if the lock mechanism is barely correct, it will become a bottleneck. Cursor officially wrote that when 20 Agents work simultaneously, the throughput will drop to the equivalent of only 1 to 3 Agents, and most of the time is spent waiting for the lock.

What's more troublesome is that Agents start to choose safe tasks. When there is no clear boundary of responsibility, they are reluctant to handle large - scale, complex, and conflict - prone tasks. Instead, they modify comments, fill in the gaps, and organize the format.

The intelligence is sufficient, but the organizational structure doesn't work.

So, Cursor later changed the system to a hierarchical structure of root planner, sub - planner, and worker. The Planner understands the overall scope and breaks down tasks. The Worker is only responsible for local tasks, unaware of the larger system and not communicating horizontally with other workers. After completion, the worker writes a handover report, reporting what has been completed, what has been discovered, where the plan has deviated, and what risks there are in the future, and submits it to the planner.

This shows that harness manages not only "who does the work". It also manages the information flow.

This includes which events are recorded, which history is retrieved, which content enters the context window, how the output of expert workers is integrated into the overall view of the supervisor Agent, and how to review and question during the task. All these are organized by the external system.

It can prevent a group of Agents from losing contact, colliding, or idling. It can manage actions, permissions, context, files, and logs.

However, this also defines its boundaries.

Harness manages the external organization and external information flow. It doesn't care whether a worker changes its judgment because of the planner's tone, whether a reviewer gives up objections because the main - line solution has taken shape, or whether multiple Agents go further and further around a wrong consensus.

A traffic system can manage how a car drives but can't manage what the driver thinks inside the car.

The second - layer problem of multi - agent starts here.

02 Layer 2: Unresolved group cognitive problems

Before discussing the second layer, we need to clarify that multi - agent is not only a concurrent execution system but also a communication system.

Agents read each other's answers, correct their own judgments based on other Agents' statements, are influenced by the majority opinion, and give up differences to reach a consensus. As long as Agents are not completely isolated, they not only share information but also share pressure.

This is the second - layer problem. When Agents form a group, certain social cognitive problems begin to emerge.

It is different from the first layer. The first layer deals with how Agents act, while the second layer deals with what Agents believe.

Information is present, but no one is willing to disclose it

In the paper "Systematic Failures in Collective Reasoning under Distributed Information in Multi - Agent LLMs" submitted by Yuxuan Li, Aoi Naito, and Hirokazu Shirado in May 2025, a test was designed.

They designed 65 tasks based on the Hidden Profile paradigm. Each Agent received a part of the information, and only by piecing together the information in everyone's hands could the correct answer be obtained.

Theoretically, this is exactly what multi - agent should be good at.

Since a single Agent can't see the whole picture, multiple Agents each master local facts and then piece together the whole through communication. This is also how a company operates. Salespeople know the customers, engineers know the system, and legal staff know the risks. Finally, they hold a meeting to make a decision.

As a result, the accuracy of multi - Agents under distributed information conditions is only 30.1%. However, if the complete information is directly given to a single Agent, the accuracy is 80.7%.

This is not a problem of poor traffic. It's that the meeting fails to uncover the hidden information. Everyone has key fragments in hand, but the discussion only revolves around the information on the table.

Harness at this layer can indeed make up for some of the problems. For example, using the repository as a memo, allowing each Agent to explicitly report what it knows, what it doesn't know, and how it differs from others.

However, the root cause lies deeper. Agents don't know how to question others or when to insist on their own piece of the puzzle.

Agents not only exchange information but also exchange pressure

If HiddenBench exposes that information is not pieced together, then MAEBE takes a step further. It doesn't ask "whether the information is transmitted" but "why an Agent changes its judgment".

There are two possible reasons for an Agent to change its answer in a discussion. The first is that it hears new evidence and realizes it's wrong after re - reasoning. The second is that it finds that other Agents are converging in a certain direction, so it follows suit.

The former is information integration, and the latter is peer pressure.

In the paper "MAEBE: Multi - Agent Emergent Behavior Framework" submitted by Sinem Erisken, Timothy Gothard, Martin Leitgab, and Ram Potham in June 2025, they studied this difference. They compared the preferences of a single LLM when answering independently with the changes in the answers of the same model in a multi - agent ensemble. The result shows that the behavior of an isolated model cannot reliably predict the behavior in a group.

In other words, a model that shows independent judgment when alone may start to follow the crowd and be wishy - washy when put in a group of Agents.

MAEBE asked those Agents that changed their minds to give reasons. Most Agents explicitly attributed the reason to the opinions of other Agents or the group consensus. For example, "considering the views of others", "based on the majority opinion", "everyone put forward reasonable arguments". They defined it as peer pressure convergence, which is what we commonly call herd pressure.

From the data, different models vary greatly. 62.8% of the convergence in Claude is attributed to peer pressure, 42.7% in Llama, and 24.8% in GPT.

But this at least shows that in multi - agent discussions, models at least explain the changes in their answers as a group influence. They no longer just say "I saw new facts" but "other Agents all think so, so I also adjusted".

The problem has shifted from information dissemination to the socialization of the reasons for judgment.

This phenomenon is also partially confirmed in the paper "The Chameleon's Limit: Investigating Persona Collapse and Homogenization in Large Language Models" submitted by Yunze Xiao, Vivienne J. Zhang, et al. in April 2026. The experiment found that a group of Agents with different personas will experience a certain degree of "geometric collapse" in the group. After 1144 LLMs with different personas completed the psychological scale, the outputs were all concentrated in 6% of the human behavior space. Each Agent's answer seems reasonable and diverse, but it can be found that they are actually extremely convergent when viewed in the group geometry.

This is the first symptom of the second - layer group cognitive problem, the convergence problem.

The paper "The Bystander Effect in Multi - Agent Reasoning: Quantifying Cognitive Loafing in Collaborative Interactions" submitted by Dahlia Shehata and Ming Li in May 2026 takes the problem a step further.

MAEBE sees group pressure, and Agents explain their convergence as "everyone thinks so". Shehata and Li are not concerned about this. They study the bystander effect, that is, when a group of Agents are present simultaneously, whether a single Agent will reduce its cognitive input.

This is similar to humans. If one person sees someone fall, they may immediately go to help. But if ten people see it, each person may think "someone else will handle it". The responsibility is diluted, and the action becomes weaker.

In the context of multi - agent, it means that the cognitive responsibility is diluted.

When there is a single Agent, the model must be independently responsible for reasoning. When there are multiple Agents, it starts to assume that "others will make up", "the group will correct", and "I don't have to take all the responsibility alone". The paper calls this phenomenon cognitive loafing.

The model actually calculates the correct derivation internally, but the external output does not adhere to this answer. This is not because it is persuaded by others but because it offloads its reasoning responsibility in the multi - agent scenario.

The early optimistic imagination of multi - agent debate was that multiple models criticizing each other would offset the hallucinations and biases of a single model. But Shehata and Li remind us that adding one more Agent doesn't necessarily mean adding one more responsibility. It may also mean reducing one.

This is more troublesome than "following the crowd".

Following the crowd at least shows that the Agent is still paying attention to the group direction. The bystander effect shows that it may not even look at the direction seriously. It just assumes that it is not the final responsible person.

The reason why the recently popular AI group - chat application Slock is evaluated as having only emotional value and no practicality may lie in these two aspects.

The industry is not solving it but avoiding it

Here, a counter - intuitive conclusion emerges.

Today's most successful multi - agent harness has not really solved the deep - seated problems revealed by the academic community from 2025 to 2026. They mostly avoid these problems.

Cursor prevents workers from negotiating horizontally, OpenAI uses worktrees to isolate modifications, Cognition keeps writing single - threaded, and Anthropic uses testing and delta debugging to break down the error space.

These are all useful, but they deal with who can act, who can write, and who can review. They can restrict the hands of Agents but are difficult to restrict their guts.

You can lock workers in independent repository copies, but you can't stop them from catering to the majority opinion.

By now, the problems of multi - agent are no longer just engineering architecture problems.

It starts to look like an organizational psychology problem.

Many people will say that it doesn't matter. The current combination of Orchestrator and Worker is essentially about breaking down tasks and distributing them to specific workers. Discussion is not necessary at all. These problems cannot stop the development of the current architecture of Agents.

In arXiv:2605.13851v1, that is, the paper "Invisible Orchestrators Suppress Protective Behavior and Dissociate Power - Holders: Safety Risks in Multi - Agent LLM Systems" in May 2026, Hiroki Fukui chose to dig deeper.

He not only wants to see the symptoms but also the pathology.

His discovery completely shatters the illusion that multi - agent can operate well only with harness.

03 Layer 3: Fukui first discovered the internal dissociation problem

If Shehata and Li tell us that group pressure makes Agents give up the correct answer, then the next question is, where does this giving - up actually happen?

Is it a last - minute change in the final output, or has the Agent split internally?

Fukui's paper answers this question.

Fukui is not an AI researcher in the conventional sense. He is a clinical doctor at the Neuropsychiatry and Sexual Offense Medical Center of Kyoto University, holding

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Multi-Agent has become popular, but no one has addressed the organizational issues of AI yet.

01

Layer 1: External organizational problems handled by harness

02

Layer 2: Unresolved group cognitive problems

03

Layer 3: Fukui first discovered the internal dissociation problem