Ranking first in China, hot on the heels of OpenAI, the mysterious "unsung hero" has broken into the world's top seven.
It's crazy! A mysterious Chinese AI, like a "hidden master" in the martial - arts world without even an official website, has entered the top seven globally in CyberGym with a 73.1% win - rate, closely trailing OpenAI. The whole internet is abuzz. Whose expert is this?
In recent days, a name that no one has ever heard of suddenly appeared on a list where global AI giants are fiercely competing.
It's called MopMonk.
There was no grand press conference, no long post on the official blog, and no cheering on social media.
It just emerged out of nowhere and directly entered the global top ten in CyberGym.
With a 73.1% success rate, it closely trailed OpenAI by a narrow margin and refreshed the highest historical score of Chinese teams on this list.
The most magical part of the whole thing is that to this day, no one knows its true face.
How important is the CyberGym list?
How amazing is MopMonk's performance this time? Just look at the arena it's in.
CyberGym was created by the UC Berkeley team, and its core paper was selected for the top - tier ICLR 2026 conference.
Portal: https://arxiv.org/pdf/2506.02548
As one of the most authoritative public benchmarks in the field of AI network security capability assessment, this is a "battlefield" for large models —
Even top - tier models like GPT - 5.5 - Cyber and Claude Mythos have engaged in fierce competition on this list.
The whole benchmark emphasizes "real - world combat":
There are 1507 vulnerability instances and 188 open - source large projects. All test questions are taken from real historical vulnerabilities accumulated by Google OSS - Fuzz.
From the perspective of the evaluation dimension, this is a cross - level breakthrough.
Its scale is 7.5 times that of the largest previous public benchmark (NYU CTF, about 200 questions), and it leaves behind "predecessors" like CVE - Bench by an order of magnitude.
What's even more challenging is the difficulty. CyberGym doesn't have multiple - choice questions.
It requires AI to conduct in - depth reasoning in real projects with thousands of files and millions of lines of code.
Because it is large enough, real enough, and difficult enough, CyberGym has "discrimination" —
It can clearly distinguish the real ability gaps between different models and different Agent frameworks.
No wonder the security circle has directly crowned it as the "Olympics in the field of AI security."
That's why almost all the global top players have participated, including Microsoft, OpenAI, Anthropic, Google, Meta, Zhipu...
The CyberGym list itself is witnessing a key turning point in AI competition:
It has shifted from comparing who has more parameters to comparing who can actually complete the tasks.
A strange Eastern code name suddenly appears among Silicon Valley AI giants
Who could have expected that on this arena that relies most on "hard power," a dark horse that "has no record" emerged.
After clearing the fog, we currently only have three known pieces of information:
Mysterious code name: MopMonk
Base model: MiniMax M3
List record: Ranks seventh globally and first in China in CyberGym
Normally, a team with such achievements would have flooded the market with technical reports and press conferences.
But on this list full of experts, MopMonk is the most complete "outsider": it only released a technical report, and there is no information about the team, the company, or the location.
This collision of "top - notch strength and bare - bones information" is full of the drama of an Eastern martial - arts story.
Those familiar with Jin Yong understand the significance of the term "hidden master" in "Tian Long Ba Bu" —
The old monk who swept the floor in the Shaolin Sutra Pavilion for decades and whose name no one remembered suddenly showed his power and suppressed two top experts, Xiao Yuanshan and Murong Bo.
The most inconspicuous character hides the deepest martial arts skills.
Daring to use the name "hidden master" to challenge, this team clearly has extremely confident belief in its own strength!
The more crucial clue lies in its technical foundation — MopMonk chose MiniMax M3 as its base.
As an open - source base from Shanghai, M3 is a well - rounded fighter, integrating three core weapons: cutting - edge programming ability, a 1M long context, and native multi - modality.
On one hand, there is an "Eastern cultural symbol," and on the other hand, there is a technical base with a pure domestic label.
Putting these two clues on the table, the scope has been narrowed down. All the clues are strongly suggesting the same conclusion:
This is probably a Chinese team.
The decisive factor lies in Harness
Putting aside the identity mystery, as people who have been tracking AI technology for a long time, we want to figure out one question:
Why did MopMonk win?
To answer this question, we need to go back to the most difficult core of CyberGym — it doesn't test "knowledge" but "execution."
For today's large models, it's not too difficult to judge whether a piece of code has a vulnerability.
But what CyberGym tests is the next and most crucial step: generating an input that can trigger the vulnerability, that is, a PoC.
It must trigger the vulnerability in the "vulnerable version" and fail in the "fixed version," and pass the execution verification in the benchmark environment.
This hurdle is much trickier than expected.
The triggering conditions of vulnerabilities are often scattered among code paths, parsing logic, build environments, test Harnesses, and input formats, and need to be pieced together bit by bit.
What's even more frustrating is that even if the PoC crashes the program locally, it may not count. As long as it doesn't meet the differential judgment of "trigger in the vulnerable version and not trigger in the fixed version," it's all in vain.
This step completely drags the task from "understanding" into "execution." And it's a very special kind of execution —
The entire test is conducted in a closed and offline environment.
There is no external search to rely on and no "off - site resources." All the AI can rely on is its understanding of the current codebase and the memory it has accumulated step by step.
To "reproduce" the vulnerability under such conditions, it relies on a set of closely - linked abilities:
Tool call planning: When to read files, when to run tests, and when to go back and change the plan;
Multi - round reasoning: If the vulnerability wasn't triggered last time, where did the problem lie, and how should it be adjusted next time;
Memory management: Structurally store the read code, tried inputs, and encountered pitfalls, instead of reading everything from scratch in each round;
Iterative verification: Approach the critical point again and again until the vulnerability is truly reproduced.
In other words, the core of the competition in CyberGym is the "action ability" of the Agent, and the "intelligence" of the model is just the entry ticket.
The key link that turns "intelligence" into "action ability" is the most underestimated term in the entire Agent field today — Harness.
Harness is the "coordination layer" between the model and external tools and the execution environment.
It is responsible for tool arrangement, context state management, and the collection and re - feeding of execution feedback.
Simply put, the model is the brain, responsible for thinking "where the vulnerability might be and how to dig it next."
Harness is the hands, feet, and nervous system, responsible for turning the brain's ideas into a series of real actions —
Opening which file, running which command, how to adjust after getting an error, and how to change in the next round if the previous one failed.
In tasks like CyberGym that require dozens or hundreds of rounds and repeated trial - and - error in millions of lines of code, the quality of Harness directly determines whether the model's intelligence can be converted into combat effectiveness.
A smart model + a mediocre Harness often results in "being able to think but not being able to do";
A model with solid abilities + a strong Harness tailored for vulnerability mining may achieve good results in such long - term tasks.
An Agent "tailored" for vulnerability mining
Now, through the GitHub technical report, the technical framework of MopMonk has become clear:
It is a new security multi - Agent system designed specifically for vulnerability mining, and the thinking base that supports its operation is MiniMax M3.
GitHub address: https://github.com/MopMonkAI/MopMonkAgent
As mentioned before, M3 is a rare open - source model that can integrate top - notch coding ability, a million - token context, and native multi - modality into a single architecture.
Just look at the scores: It achieved 59.0% in SWE - Bench Pro, 66.0% in Terminal - Bench 2.1, and 74.2% in MCP Atlas —
These impressive data precisely meet the most essential ability requirements when an Agent is put into actual combat.
Moreover, it can iterate and correct itself autonomously during tasks that last for more than a dozen hours.
In other words, M3 acts as a "super - brain" with top - notch code parsing ability, long - term memory, and proficient tool - calling ability.
For tasks like CyberGym that often require processing an entire codebase and running dozens of rounds, a 1M context window is almost a necessity.
What the MopMonk security Agent framework does is to amplify the abilities of the M3 brain into the execution ability for vulnerability mining.
Judging from the technical details publicly available on GitHub, its "internal strength" core lies in three strategies —
First strategy: Structured "vulnerability memory."
It doesn't simply stack chat records or stuff the long context into the model all at once. Instead, it organizes a sustainable "task fact memory" around the most critical objects in vulnerability mining:
Vulnerability targets, code paths, input formats, candidate PoCs, failure evidence, verification status, and "next - step constraint" memory.
The last type is particularly skillful: It doesn't generate vague abstract plans but directly extracts the hard constraints that the next experiment must meet from the current evidence.
For example, "This time, it must cover that branch," "Which field should be adjusted," "Which type of failure reason should be excluded."
This memory design turns vulnerability mining from "repeated trial - and - error from scratch" into a "convergent process based on evidence."
Every code reading, every execution result, and every failed submission is converted into reusable constraints for generating the next PoC.
Second strategy: Memory - driven "vulnerability mining."
In the vulnerability mining task, the system first scans the codebase and initializes the vulnerability memory with the candidate