Just now, there was a stunning reversal in the world's most difficult exam. The dark horse AI broke through 36%, while the top - tier models all failed.
[Introduction] Just yesterday, ARC-AGI-3 completely outperformed the world's top large models. However, an unknown company dropped a bombshell: their AI achieved a score of 36.08% on its first day! What enabled this dark horse to break through the iron curtain of the world's toughest AI exam? Is it a genuine breakthrough, or is there more to it?
An astonishing reversal!
Just yesterday, ARC-AGI-3, the toughest test for AI, made its debut, and overnight, global large models were severely defeated.
The top model, Opus 4.6, only scored 0.2%, a truly dismal result. Meanwhile, humans far outperformed, achieving a perfect score.
This left onlookers stunned: both Huang (Jensen Huang) and the inventor of the AGI concept believed that we had reached AGI. Are we really that far from AGI?
Surprisingly, ARC-AGI-3 was cracked within just one day!
Just now, a company named Symbolica announced:
Using the Agentica framework, we achieved a score of 36.08% on the first day of the ARC-AGI-3 test, comprehensively outperforming the CoT model baseline.
They successfully cleared 113 out of 182 levels. Among the 25 available games, they completed 7.
The toughest exam in the world has been breached!
Symbolica's unexpected success on the first day, reaching 36%
Just when people were lamenting Opus 4.6's pitiful 0.2% score and even beginning to doubt whether "AGI is just an illusion woven by big tech companies," a pleasant turn of events occurred.
Why was Symbolica's Agentica framework able to achieve an astonishing score of 36.08% on the day ARC-AGI-3 was released?
Agentica (Symbolica) is a dedicated intelligent agent system for ARC-AGI-3 built on Symbolica.
Given ARC-AGI-3's almost extreme scoring formula - (human steps / AI steps)^2 - the leading large models are still wandering in the dark. A score of 36.08% is a real game-changer.
To understand why Symbolica won, we first need to understand why Opus 4.6 and GPT-5.4 lost.
The biggest difference between ARC-AGI-3 and its previous two generations is that it is not a "static image description" but an interactive black-box game.
When a pure LLM-based intelligent agent enters the game, its most fatal weakness is that it tries to use association instead of logic and pattern matching instead of experimentation.
When facing an unknown environment, large models use their vast pre-trained knowledge base to "fill in the blanks." Seeing red squares and blue lines, they might associate it with "Sokoban" or "water level balance" and then output CoT based on this wrong assumption.
If the assumption is wrong, it won't stop to reflect but will keep going down the wrong path until it runs out of steps and scores zero.
ARC-AGI-3 precisely targets these weaknesses of AI and measures three key abilities of AI in an environment that humans can solve 100%:
- Efficiency of skill acquisition over time
- Long-range planning ability under sparse feedback
- Experience-driven adaptability across multiple steps
Symbolica's Agentica framework takes a completely different technical approach!
Agentica natively supports a multi-agent architecture and is designed to be parallelizable. It automatically breaks down complex tasks into sub-problems and delegates the work to sub-agents to complete in parallel.
This means that the agents can work efficiently and complete tasks faster right out of the box!
Agentica is a type-safe AI framework that enables seamless integration of LLM agents with code, including functions, classes, active objects, and even entire SDKs.
Previously, thanks to its powerful long-range reasoning capabilities, Symbolica achieved state-of-the-art results on ARC-AGI-2, and the Agentica SDK played a crucial role.
Core secret: Arcgentica RLM harness
From the GitHub page, in the IDEA.md file, we discovered the secret weapon of the Agentica framework - the ARC-AGI-3 intelligent agent framework (Agent Harnesses).
GitHub link: https://github.com/symbolica-ai/ARC-AGI-3-Agents
Agent Harnesses has become a hot topic recently and is constantly mentioned in Anthropic's official blog and discussions among industry experts.
If 2025 was the starting point of the golden age of intelligent agents, then 2026 will focus on intelligent agent frameworks (Agent Harnesses).
An intelligent agent framework is an infrastructure built around an AI model to manage long-running tasks, but it is not an intelligent agent itself.
This time, Agentica understood the game mechanism from scratch and solved multiple level puzzles without any specific game hints.
What makes the Arcgentica RLM framework, built on the Agentica SDK, special?
First, it is game-agnostic.
ARC-AGI-3 is difficult because it removes all natural language prompts. Humans can pass the test because we have physical intuition.
Therefore, Agentica adopts the most extreme "game-agnostic" strategy.
The agent doesn't know what colors represent, what actions do, or what the winning conditions are. It infers everything by interacting with the game and observing changes.
This blank state actually works in its favor.
Second, it uses a "coordinator + specialized sub-agents" model.
The top coordinator never directly operates the game. It delegates tasks to sub-agents, accumulates knowledge, and decides the next action.
Specialized sub-agents include: explorers, theorists, testers, and solvers
If it starts looking at the grid, its context will be filled with pixel data, and it will lose its strategic thinking ability. Sub-agents report in the form of short text summaries rather than raw data.
This well-designed decentralized strategy structure avoids the serious flaw in models like Opus 4.6, where "the same brain has to look at pixels, remember rules, and direct actions."
Third, it has a "shared memory" mechanism.
During the game, all agents share a memories database. Sub-agents record confirmed facts (scene layout, mechanisms, winning conditions) and assumptions (clearly marked) during their work.
New agents query the memory before starting, so they can inherit collective knowledge.
Fourth, it has a "level switching" mechanism.
Level switching: When a level is solved, the next level is directly loaded in the same operation, and the returned screen is already the new level.
Only when all levels are cleared will state=WIN be triggered; the completion of a single level is determined by observing the increase in levels_completed.
Fifth, Agentica has strict action budget management, and every token is used wisely.
The total number of operations for all levels is limited (about 800 times). The scheduler allocates operation quotas to each sub-agent through make_bounded_submit_action(limit). The system requires agents to avoid repeated operations unless they are really stuck.
Moreover, it prioritizes targeted attempts rather than brute-force exhaustive exploration.
In addition, there are regulations such as sub-agents needing to allocate tools as needed and the scheduler needing to balance reuse and restart.
It should be noted that the official positioning of ARC-AGI-3 emphasizes "abilities such as exploration, perception → planning → action, memory, goal acquisition, and alignment."
Agentica's division of labor and control strategy is almost an "engineering breakdown" of these abilities:
Exploration: Conducted by sub-agent explorers within the action budget, extracting "mechanism clues" through differential observation as much as possible.
Planning/rule inference: Sub-agent theorists derive rules under the constraint of "not allowing submit_action" to reduce the consumption of meaningless actions.
Memory: The explicit memories database makes it easier to reuse cross-level strategies, reducing the action and token costs of "repeated learning."
Long-range adaptation: Level transitions are detected by levels_completed, and the coordinator decides whether to continue using the existing strategy or re-enter the exploration cycle.
Obviously, this mechanism is well-suited to ARC-AGI-3's scoring structure (higher weights for later levels and efficiency square penalty) - it encourages the system to spend actions on experiments with the highest "information gain" and quickly transfer strategies to higher-weight levels.
Is the high score of 36.08% inflated?
However, although the 36% score is undoubtedly impressive, until it is officially verified by the ARC Prize, Symbolica's "upset" is still shrouded in mystery.
Symbolica also admits that this result has not been officially certified by the ARC-AGI-3 organizing committee.
There is a very crucial sentence in the materials: "unverified competition score" (unverified result)
Is Symbolica's current result based on its self-built environment or a strict reproduction of the official evaluation process? This is a question that needs to be answered.
Moreover, there are some unusual details in the published score breakdown.
For example, Symbolica pointed out that "the human baseline score obtained through the ARC-AGI-3 API indicates that game cn04 has a total of 6 levels. This does not match the number of levels of the