GPT-5.5: Completely Breaks Through 300 Hacker Evaluation Tasks with Just 50 Million Tokens

[Introduction] GPT-5.5 has shattered all seven of the most challenging benchmarks in offensive cyber security, achieving a 92.4% accuracy rate, causing the evaluation system to malfunction. The hacking capabilities of AI double every six months, and the "ruler" used to measure its danger has already been broken.

GPT-5.5 solved 292 out of 316 offensive cyber security tasks, with an astonishing accuracy rate of 92.4%!

On May 27th, Australian research institution Lyptus Research released a report stating that GPT-5.5 saturated their entire evaluation system.

https://x.com/LyptusResearch/status/2059428814103642340

The seven benchmarks cover exploit utilization, CTF flag - capturing, and real CVE reproduction. Each question has the completion time of human security experts as a baseline.

GPT-5.5 demonstrated the capabilities of a top - tier hacker team.

The remaining 24 unsolved questions are insufficient to draw a statistically significant capability curve.

The research team's judgment is that this evaluation method is "no longer applicable" for such tasks.

When they started building this test in December 2025, they selected the most difficult questions available globally.

By March 2026 in the first - edition report, there were signs of data saturation.

By May, saturation became a reality.

In six months, it went from "the most difficult" to "insufficient."

The Progress Curve is Soaring

The slope of this capability curve is truly terrifying.

Since 2024, Lyptus has been tracking and fitting the conclusion that the offensive cyber security capabilities of AI double every 5 to 6 months.

At the beginning of 2026, Claude Opus 4.6 had a time horizon of 3.2 hours, and GPT - 5.3 Codex had 3.1 hours. Two months later, GPT - 5.5 directly reached 5.1 hours.

Given sufficient computing power, once it surpasses the 12 - hour measurement limit, the chart can't even accommodate it.

What's even more remarkable is the variable of the Token budget.

On the most difficult benchmark, CyberGym, GPT - 5.5 had an accuracy rate of 54.4% with a 2 - million - Token budget. When the budget was increased to 50 million Tokens, the accuracy rate reached 86.4%.

For the same model, the accuracy rate increased by 32 percentage points.

Research from the UK AI Safety Institute (AIUK AI Safety Institute) also confirmed this. When given a 100 - million - token budget, the capabilities continued to increase without a plateau.

All publicly available benchmark test results are obtained under limited budgets. The real ceiling of capabilities is far higher than the reported figures.

Powerful Models are Under Control

Leading laboratories have been forced to take sides.

In April, Anthropic released Claude Mythos Preview and decided not to make it public, citing its overly strong cyber security capabilities. It also launched Project Glasswing, deploying Mythos to the defenders of critical infrastructure.

OpenAI rated the cyber security capabilities of GPT - 5.5 as "High," just one level below the highest level, "Critical." All attack - related capabilities passed the "Trusted Access for Cyber" gatekeeping.

METR's independent evaluation of Mythos hit the same wall. The fitted time horizon is at least 16 hours, but they dare not provide a point estimate for this figure, only saying that "caution should be exercised."

Controlling who can use these models is currently the only strategy.

However, the window is narrowing.

Lyptus measured an indicator called the "adaptation buffer period," which is the time difference between the closed - source cutting - edge capabilities and their spread to open - source models.

In the field of offensive cyber security, this gap is approximately 5.7 to 13.1 months.

At this rate, the attack capabilities at the level of Mythos and GPT - 5.5 may fall into anyone's hands in an open - source form within the year.

The Measuring Ruler is Broken

Let's get back to the core issue.

The most disturbing part of this is that no one can accurately say how powerful the current large models are at their upper limit.

The logic of the time horizon methodology is simple: use tasks more difficult than the model's capabilities to anchor the inflection point of the curve.

When the model completes all tasks, the inflection point disappears, and the curve cannot be fitted.

The evaluation system is not falsified; it is left behind by the growth of capabilities.

To create more difficult tests, more time and manpower are required.

The model's capabilities double every six months, while the test development cycle is much longer.

More importantly, the UK AI Safety Institute found that as long as the attacker is willing to use more computing power, even if there are more difficult questions, they can still be solved.

The evaluation cannot keep up with the capabilities.

When this structural dilemma is viewed in a larger framework, the signal is quite clear.

In a highly specialized field, the ruler set by humans for AI capabilities has been broken.

Cyber security happens to be one of the most quantifiable fields, with clear success criteria: whether a vulnerability is found or not, and whether a system is breached or not.

If the evaluation in a field with such hard indicators can't keep up, what about those more vague and difficult - to - quantify capability dimensions?

If the growth rate of doubling every six months is maintained, the capabilities will be 4 times that of today in one year and 16 times in two years.

On the path to AGI and even ASI, it won't be the only ruler that gets broken.

Not being able to see the boundary is more dangerous than the boundary itself.

References:

https://lyptusresearch.org/research/gpt-5-5-saturates-offensive-cyber-time-horizons

https://x.com/LyptusResearch/status/2059428814103642340

This article is from the WeChat official account "New Intelligence Yuan", author: ASI Revelation; editor: Marco. Republished by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

GPT-5.5 completely breaks through 300 hacker evaluation tasks with only 50 million Tokens.

The Progress Curve is Soaring

Powerful Models are Under Control

The Measuring Ruler is Broken