HomeArticle

The most powerful hacker large model is no longer Mythos.

新智元2026-05-15 15:18
The dark horse has taken the lead.

[Introduction] Microsoft used a multi - Agent system to top the leading benchmark test for AI vulnerability discovery, outperforming Anthropic's strongest model, Mythos, by five percentage points. Strangely enough, Microsoft doesn't have a cutting - edge model of its own. It assembled a system using others' models and defeated the companies that created these models. The implications of this for the AI competition landscape are more important than the fact that this tool discovered a large number of Windows vulnerabilities.

The most powerful hacker large model, Mythos, has been overtaken by a dark horse!

On May 12, Microsoft released an AI security system codenamed MDASH and simultaneously topped the CyberGym benchmark test with a score of 88.45%.

Behind it are Anthropic's Mythos Preview (83.1%) and OpenAI's GPT - 5.5 (81.8%).

https://www.cybergym.io/

On the CyberGym list, Anthropic used its own strongest model, Mythos, and OpenAI used its own strongest model, GPT - 5.5.

What did Microsoft use?

The answer is, models from other companies.

Microsoft clearly stated in its blog that MDASH uses only "generally available models", that is, models publicly available on the market.

https://www.microsoft.com/en-us/security/blog/2026/05/12/defense-at-ai-speed-microsofts-new-multi-model-agentic-security-system-tops-leading-industry-benchmark/

Microsoft doesn't have a cutting - edge model that can compete with Mythos or GPT - 5.5.

On this list, if Microsoft used a single model, its score would probably fall in the middle or lower reaches.

But it assembled a system, scheduling more than 100 specialized Agents, allowing multiple models to work together in a division of labor, and achieved a higher score than any single model.

It used others' bricks to build the tallest building.

Microsoft has used this tool to discover 16 high - risk vulnerabilities in its own Windows 11 system!

Demonstration of the effect of the vulnerability CVE - 2026 - 33827 that causes a blue screen due to remote execution

What kind of list is this?

CyberGym was developed by the UC Berkeley team, and the paper was published in ICLR 2026. It is one of the most authoritative public benchmarks in the field of AI security capability assessment at present.

https://arxiv.org/pdf/2506.02548

Anthropic, OpenAI, Meta, and Zhipu have all submitted their scores on it.

The test method is straightforward. Give the AI a piece of code with known vulnerabilities and a description of the vulnerabilities, and let it write the attack code that can trigger the vulnerabilities by itself.

There are 1507 questions from 188 real open - source projects.

Whether it can discover vulnerabilities and prove that they can be exploited can be known through a single test.

One detail is worth noting. The scores on the list are submitted by each company themselves, and the benchmark code is public but there is no independent third - party verification.

The powerful capabilities of the multi - Agent system

The core inspiration brought by MDASH: A "system" can bridge or even reverse the gap of a "model".

Anthropic invested a huge amount in R & D to train Mythos, which is currently recognized as the strongest single model in the security field. It is so powerful that Anthropic itself doesn't dare to publicly release it and only opens it to a few companies through a coalition called Project Glasswing.

OpenAI's GPT - 5.5 is also a cutting - edge model trained with the full strength of the company.

Microsoft doesn't have such a model.

But it has a pipeline that divides the five stages of "preparation → scanning → verification → deduplication → proof" and uses different Agents and different models for each stage.

The auditing Agent and the debating Agent are separated, the discovery of vulnerabilities and the proof of vulnerabilities are separated. Large models are used for heavy - duty reasoning, and distilled small models are used for high - frequency verification.

The key is that this system is not bound to the underlying models.

When a new model comes out, just change the configuration and run an A/B test, and all the engineering assets accumulated before can be reused.

Microsoft particularly emphasized this point in its blog - "the model is one input", the model is just one of many inputs.

This poses a new type of threat to Anthropic and OpenAI.

The advantages of the models they trained with astronomical amounts of dollars have been eliminated by an engineering - based competitor at the system level.

What's more frustrating is that Microsoft used their own models.

What potential variables will this bring to the end - game of ASI?

On the table of cutting - edge models, only Anthropic and OpenAI really have the chips.

Although Microsoft is the largest investor and cloud - computing partner of OpenAI, it has never trained a flagship large model that truly enters the first echelon.

The result of this CyberGym test has put a question on the table: Is there one or two paths to ASI?

The first path is the one that Anthropic and OpenAI are taking, pushing a single model to the extreme.

Mythos' capabilities in the security field are so strong that its release needs to be restricted, and GPT - 5.5 has continuously refreshed records on multiple benchmarks.

Mythos is only tested through Project Glasswing

This path requires a huge amount of computing power, a large amount of data, and a top - notch research team, with extremely high thresholds.

The second path is the one demonstrated by Microsoft with MDASH, not aiming to create the strongest single model, but instead building a system that can maximize the capabilities of existing models.

More than 100 Agents perform their respective duties. The differences between models become signals, and the multi - stage pipeline achieves what single - inference cannot do through task decomposition.

The results of MDASH prove that the second path is at least feasible in specific fields.

But this doesn't mean that the second path can replace the first path.

The underlying models used by MDASH still come from the companies on the first path.

If Anthropic and OpenAI stop training stronger models, the ceiling of MDASH will also stagnate.

This is not just about Microsoft

As a paradigm, the multi - Agent system is moving from experiment to production.

Many core members of the MDASH team are from Team Atlanta, the team that won a $29.5 million prize in the DARPA AI Cyber Challenge.

A core judgment they verified is that making AI perform professional - level security audits requires far more engineering work than the model itself.

Microsoft simultaneously announced 16 Windows vulnerabilities discovered with the assistance of MDASH, 4 of which are Critical - level remote code executions.

Most of these vulnerabilities can be triggered from the network side without authentication and have been fixed in the May Patch Tuesday.

In the internal retrospective test, MDASH achieved a recall rate of 96% for the confirmed vulnerabilities of the Windows core component clfs.sys in the past five years and 100% for tcpip.sys.

The significance of these numbers lies in that they come from actual combat, not just benchmarking.

16 CVEs have entered Microsoft's official patch process, and the 96% recall rate is for the vulnerabilities that have been exploited by attackers in the past five years.

Microsoft said in its blog that future Patch Tuesdays will be larger and larger.

AI is accelerating the speed of vulnerability discovery, and the scale of patches will naturally increase.

The opposite is also true. Attackers can also use the same technology.

MDASH uses all publicly available models and has no exclusive technical barriers.

What else should we pay attention to?

For the industry, the significance of MDASH is greater than MDASH itself.

It verifies a conjecture: In the next - stage competition of AI capabilities, "building a system around the model" may be as important as "training a stronger model".

This has different meanings for three types of people.

For model companies (Anthropic, OpenAI), it rings an alarm bell.

Leading in model capabilities cannot automatically translate into leading in the application layer.

Others can use your model and beat you on your own turf.

For platform companies (Google, Microsoft), it points out a differentiated path.

Don't have the strongest model? It doesn't matter. Build the strongest system.

But the premise is that you have to have a deep understanding of the engineering details of specific fields. The design of the division of labor of more than 100 Agents, domain plugins, and verification pipelines, the threshold for accumulating these things is also very high.

For ordinary users, the direct impact of this is simple: patch in time, otherwise even those who don't understand technology can use AI to exploit such vulnerabilities.

Like Mythos and GPT - 5.5 Cyber, MDASH is currently undergoing private testing with a small number of customers. Microsoft has not announced the pricing and official release time.

Reference: https://www.microsoft.com/en-us/security/blog/2026/05/12/defense-at-ai-speed-microsofts-new-multi-model-agentic-security-system-tops-leading-industry-benchmark/ 

This article is from the WeChat official account "New Intelligence Yuan", edited by Allen. It is published by 36Kr with authorization.