Dialogue with MiniMax (Yuan-1): Agents to Eventually Surpass Humanity

A storm is brewing.

In the world of agents, April was still a time of gathering storms. Before May even ended, it was already a bloodbath.

The acceleration of the entire industry has been unreasonably fast. Vibe Coding is no longer a new term, and the programming track has never been so crowded: Claude Code, Codex, and Cursor are in close combat, while Trae, Qoder, and CodeBuddy are locked in a fierce battle.

Industry jargon has been popping up one after another, capturing everyone's attention. Last year, it was all about "skill," but now "harness" has taken the throne.

Amidst the trendy terms, the competition among models has become so intense that it's almost a flat line: Different benchmark tests may give different answers, but generally speaking, whether it's Opus, GPT, or Qwen, GLM, Kimi, and MiniMax, they can handle both writing code and performing increasingly complex tasks with ease.

There are still differences between models, but what truly sets model companies apart is no longer the model itself, but the shell that wraps around it.

A previous research report dissected the leaked code of Claude Code and found that only 1.6% of the code was actually related to model decision-making, while the remaining 98.4% was all about harness for managing permissions, context, and error handling.

To further leverage the advantages of models, a new generation of agent products has emerged like a tidal wave. Grok Build, Qoder 1.0, and TRAE SOLO have all been launched. Even DeepSeek, which has always been low - key, has posted multiple job openings to assemble an agent development team.

MiniMax, which started deploying agents earlier than the industry, has made its move in the melee. Its desktop product first launched the Agent Team feature with a brand - new multi - agent orchestration architecture in mid - May. With the M3 flagship model, MiniMax's desktop product has been fully upgraded to MiniMax Code, once again stirring up the agent market where big players and emerging startups coexist.

The core of the Agent Team is an "adversarial" architecture of Leader - Worker - Verifier. The responsibilities of doing the work and checking for errors are split into different agents, managed by a state machine with solidified code logic, and the context between them is isolated.

This approach addresses the well - known stubborn problems in long - term agent tasks: context pollution, context anxiety, and "collusion" between agents.

Interestingly, as mentioned before, MiniMax didn't wait for the release of M3. Instead, it launched the Agent Team on M2.7 first. MiniMax calls the M2 generation "great skill appears as clumsiness." The symbiosis between the model and the harness has shown the dawn of a new era. As expected, M3 will only be stronger.

At the most critical moment of training M3, APPSO had a conversation with Zhou Chunfu, a research and development engineer of MiniMax Agent.

We discussed the design principles of the Agent Team and the insights of MiniMax it represents, explored the technical core of the Agent Team, and analyzed how other players constrain and allow agentic models.

There is a prevalent view in the industry: Anthropic has the best model but the worst engineering. In Zeyin's opinion, Anthropic fundamentally doesn't trust the model, presuming that the model will cheat and be tricky, so it imposes constraints everywhere. In contrast, the core of OpenAI's harness is a minimalist agentic loop.

A minimalist framework has nurtured a model with excellent compliance, while a highly constrained framework has produced a "black swan." MiniMax's approach to building agents combines the two but is not exactly the same: It believes in the model and gives it the same operational permissions as humans, but also adds reasonable constraints to the harness.

These ideas are unique in the industry, but the industry's pace of catching up with new things and establishing them as common knowledge is already faster than the speed of new ideas emerging. In the field of agents, MiniMax has no barriers - and neither does anyone else. Zeyin sent me a 71 - page paper and told APPSO:

"All the information about agents is in this paper. If one paper can explain it all, what barriers are there?"

But MiniMax still has its unique skills.

They strive to continuously output new insights to the entire industry at the fastest speed, acting as the leader, executor, and verifier of common knowledge - that's why the Agent Team and its underlying architecture were made public before M3.

Ultimately, the "open - source" approach of Chinese model companies won't last forever.

But that doesn't mean that excellent insights shouldn't be shared with the world in a timely manner.

Just as an agent's work has its stopping conditions, the people developing agents will also reach a stopping point. For Zeyin, it might be when agents can achieve true self - evolution and are more efficient and cost - effective than humans in almost any digital or physical world task.

From his perspective on the front line, we're not far from that future.

Below is the conversation between APPSO and Zeyin, a research and development engineer of MiniMax Agent. Here's a teaser: At the end, we asked an open - ended question and got an unexpected answer.

Architecture Reflects Cognition

APPSO: Why was the Agent Team launched on M2.7 instead of waiting for M3?

Zeyin: It's our intention and our own rhythm not to wait for the new model. We just want to convey our latest insights to the outside world continuously. This is a very worthwhile thing to do. And it has been used internally for a long time. After a month, we thought it was ready for public release.

APPSO: Nowadays, all cycles are getting shorter. A month is a long time.

Zeyin: When we released it, our model hadn't been iterated yet, but a group of core users were interested in our agent's operation paradigm, so we released it early to attract them. Building a core user base is very important to us. Later, we'll also consider open - sourcing our Agent Team architecture.

APPSO: How has the feedback on MiniMax Code been so far?

Zeyin: This time, we've sorted out the subscription logic. You can use the agent once you subscribe to the token plan. After more than a month, the download and subscription numbers have shown a considerable increase. This is actually quite interesting because if we only provide the API, the user's threshold for using the model is high, and the usage effect is not optimal. MiniMax Code allows users to directly experience the full - fledged model, which is in line with our long - standing thinking, and this time it has been verified. I think it's great, and it will only get better with M3.

There's an interesting point on the user side. Since we support all modalities, we've found that many users use the Agent Team to generate long - form videos, and some ancient literature enthusiasts use it to generate a large number of poetry recitation audios. These C - end, interest - oriented use cases were actually beyond our expectations.

Many users also told us that the feeling of having the Agent Team up and running gives them great emotional value.

APPSO: Does it really feel like having a few employees working for you?

Zeyin: Yes. Generally speaking, the multi - agent products in the past two months have been in a fierce battle. Tencent's (Marvis) gives a stronger sense of "working." Obviously, everyone is closely following the consensus and implementation of the Agent Team.

APPSO: You said that some people use MiniMax Code to make videos. Will it be possible in the future to make videos without using professional video - generation tools, without knowing about scripts, storyboards, or opening and closing frames, and just use the agent to call the all - modality model?

Zeyin: First of all, I'm talking about video - making from the perspective of individual users and hobbies. I think it's feasible. For professional video production, although an Agent Team can run a sample, if it's really put into industrial production, division of labor is still needed. For example, the director is responsible for the ideas, storyboards, and opening and closing frames. Another group of people is responsible for using tools like Hailuo or Seedance for resource extraction.

But I think as the model's capabilities improve, the cost of resource extraction and subsequent editing will be significantly reduced.

We did some research and found that it's actually cheaper to hire an editor to edit videos than to use AI. There's even a service in the market that bundles resource extraction and editing, but the main cost is for resource extraction, and editing is almost free. In fact, they recruit a group of college students to learn video editing in class. The students pay tuition, and their course assignment is to edit videos for them.

APPSO: If a more powerful model like M3 comes out, can it be cheaper than manual editing?

Zeyin: Our model has the ability, but if you do the math, as I said before, the cost of human labor will also keep decreasing.

APPSO: The Agent Team architecture of MiniMax Code, that is, Leader - Worker - Verifier, sounds very reasonable. You developed it first, and then Claude Code followed suit.

Zeyin: We started working on it in March. At first, I discussed with my colleagues. Once an agent makes a mistake, it will always remember that in the previous trajectory. But then I thought, if it then does things in the right direction, it actually doesn't need this wrong - doing memory at all, right?

Based on this idea, we designed this new architecture: separating the agent that does the work from the one that verifies. There should be a mechanism for sending the work back during verification, and a new "brain" should be used for this.

We built this architecture that month, and at that time, it was mainly for internal use, and everyone enjoyed using it very much.

APPSO: What are the specific benefits of using it internally? Is it that it solves previous pain points, or is it more efficient and less error - prone?

Zeyin: Let me give you the simplest example. For instance, if you assign a task to it before going to sleep, even if it's an extremely complex task, as long as you set strict controls, and your exit criteria are quantifiable and observable, rather than just letting the model decide on its own - as long as you set up these "access controls," the workers and verifiers can keep running while you're sleeping, and the task will be done when you wake up.

We can say that since March, this new development rhythm and work style have emerged within our company.

APPSO: What's the fundamental difference between this and the traditional multi - agent orchestration that relies on prompts?

Zeyin: The fundamental difference is that our Agent Team architecture has a set of complex freedom - limiting mechanisms.

First of all, at the operation level, it's a state machine, which is deterministic code with strict restrictions. It can't go beyond this specification. You can think of it as a more strict workflow.

At the agent infrastructure level, we give a great deal of freedom. All agents can communicate with each other, which is completely different from the traditional agentic workflow with a directional flowchart. Of course, the previous workflow could also have loops, but the core was still one step after another.

Let me give you an example. Suppose you use an agent for development, and the lack of a certain package in the environment blocks the development. In the previous workflow, it might get stuck, but in our architecture, once the worker or verifier discovers this, it can notify other agents through various sound mechanisms to avoid the pitfall.

For another example, in a research - related task, the leader needs to do some preliminary research at the beginning. In the past, the leader would stop after assigning the tasks. But in our architecture, if the user has new ideas or supplementary thoughts, they can directly tell the leader, and the leader can start at any time, interrupt the current agent team, and add a new orchestration. The agent workflow can be adjusted at any time, and the rest of the heavy work can be left to the model.

As we know, in the context of reinforcement learning, "context anxiety" occurs. When the context is too long, the model doesn't want to work - because not working means no mistakes. Our logic makes it follow the orchestration more strictly and keep working until it meets the exit criteria.

APPSO: How do you make agents with the same model source achieve confrontation and avoid collusion?

Zeyin: The answer is simple: it's still the prompts. In 2026, most models have strong enough compliance capabilities, and prompts are becoming more useful. We also do some "detailed work" on prompts. More importantly, we give the model observable stopping conditions. Let the worker and verifier manage different things. For example, the worker's stopping condition is to finish the work, and the verifier's stopping condition is to find bugs in the finished work.

APPSO: In my experience, sometimes I think the work can be delivered, but the agents are still going back and forth. How do you define the intensity of the confrontation between agents? Being too lenient is definitely not good, and being too strict will lead to an infinite loop.

Zeyin: We don't assume all user production scenarios, so we first present this framework, and users can set their own stopping conditions. As for how to set them, through skills, agents can actively summarize them according to the user's preferences for stopping conditions, and these can be used as judgment criteria for future tasks. These skills will vary from user to user, and it's not up to us to generalize. As users use it for a long time, Mavis will understand users better and better.

We've also added similar data during the training of M3 to make the model more proactive, summarize previous trajectories, and extract skills based on user feedback to make the work more observable. As the model's capabilities improve, we can do more and more.

APPSO: One of the features of MiniMax Code is the context isolation between agents, which is counter - intuitive. How did you come up with this idea?

Zeyin: The agent context is divided into three parts: user requests, production materials in the

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Dialogue with MiniMax (Yuan-1): Agents will eventually surpass humanity—where do we go from here?

Architecture Reflects Cognition