Going head-to-head with OpenAI, a Chinese team ranks among the top two globally in Agentic AI, achieving instant fame with a single battle.
[Introduction] Feeling AI made a strong breakthrough in the authoritative Terminal-Bench 2.0 list with CodeBrain-1, ranking second globally, only after OpenAI's latest flagship model. This not only breaks the absolute monopoly of American giants but also marks that China's AI engineering capabilities in the fields of Agentic AI (intelligent agents) complex task planning and autonomous coding have reached the world's top level.
On the eve of the Chinese Lunar New Year, the air in the global tech circle is not only filled with the spirit of saying goodbye to the old and welcoming the new but also mixed with an unprecedented sense of competition.
Anthropic launched Claude Opus 4.6, and OpenAI responded strongly with GPT-5.3-Codex.
The confrontation between the two sides at the peak of technology seems to be the old story of the "struggle for the throne," but beneath the calm surface, the underlying logic of the competition has quietly changed.
The global large model competition has officially evolved from the "parameter game" in the laboratory to the cruel "real-world evolution."
This time, the giants are no longer indulged in the illusory prosperity of benchmark scores but are firmly focusing on the rigor of the architecture and the long-term sustainability of the autonomous workflow -
Whether they can "break the deadlock" in the real business world has become the only criterion.
In the head-to-head confrontation of hardcore indicators, both OpenAI and Anthropic chose Terminal-Bench 2.0 as the endorsement of their strength: Opus 4.6 demonstrated excellent intelligent agent coding ability with a 65.4% win rate in the Agentic Terminal Coding Task; while Sam Altman, with the combination of 5.3-Codex + Simple Codex, set a high score of 77.3% (75.1%), claiming to have topped the global coding performance list.
As Jim Fan, the chief scientist of NVIDIA, said: The real terminal environment is the "devil's training ground" for AI.
Self-evolution in a closed-loop environment has become the ultimate measure of a model's engineering ability.
Excitingly, in this authoritative track, China's AI startup team, Feeling AI, has emerged as a dark horse. With the support of the GPT-5.3-Codex base model, its self-developed CodeBrain-1 jumped to the second place on the global list with an amazing record of 72.9% (70.3%), becoming the only Chinese newcomer among the top ten.
After achieving SOTA in Agentic Memory, Feeling AI scores big again
Five days ago, the Feeling AI team released MemBrain1.0 late at night, achieving new SOTAs in multiple mainstream memory benchmark evaluations such as LoCoMo / LongMemEval / PersonaMem-v2, surpassing memory systems and full-context models like MemOS, Zep, and EverMemOS.
In the two most difficult evaluations of KnowMeBench Level III, the results were significantly improved by more than 300% compared with the existing evaluation results.
Feeling AI played the first card in the new trend of AI technology and capital investment - the field of Agentic Memory.
The powerful memory ability and the hierarchical memory system adapted to the model's native features mean that Agentic AI is gradually making a paradigm shift from model capabilities to the user experience level.
Following the success of MemBrain 1.0, Feeling AI played the second card last night - CodeBrain.
As an "evolutionary brain" with dynamic planning and strategy adjustment capabilities, CodeBrain-1 quickly ranked second globally on the authoritative Terminal-Bench2.0 list, only after OpenAI 5.3-Codex's official Simple Codex.
In Feeling AI's official media, it has always emphasized that dynamic interaction is the ultimate piece of the puzzle for the world model to reach AGI.
Its original cross-modal hierarchical architecture proposes three core capabilities - InteractBrain, which is responsible for understanding, memory, and planning; InteractSkill, which is responsible for ability execution; and InteractRender, which is responsible for rendering and presentation. Together, they form its technological moat.
Currently, both MemBrain and CodeBrain, which have shown their strength, belong to the core layer of InteractBrain, precisely targeting in-depth understanding and long-term planning in complex dynamic interaction scenarios.
It seems that these two achievements with convincing results globally are not accidental but part of an early layout.
This further explains that whether it is MemBrain1.0 for Agentic Memory or CodeBrain-1 for ensuring the success rate of model task planning and execution, the core focus of their algorithms is also on serving the capabilities in complex "dynamic interaction" scenarios.
OpenAI clearly defines Simple Codex on its official technology blog as "the optimal solution for long-term software engineering tasks."
A good combination of the model and the Agent framework may become the standard form for the commercial implementation of large models in the future.
In the future, the memory ability of Agentic Memory may become part of the Agent framework, like an external memory brain, making the model stronger through systematic capabilities .
A Chinese framework that can harness the world's top models is precisely the most core intelligent hub in the AI era.
The ability to deeply drive top models means that Chinese teams have occupied a high point in the "tactical control center" of the AI era and are participating in defining the engineering standards for future large models.
CodeBrain-1, the "brain" that can dynamically adjust plans and strategies
The latest ranking on the official Terminal-Bench evaluation website shows that CodeBrain-1 ranks second, only after OpenAI's Simple Codex (GPT-5.3-Codex), and Factory's Droid, which uses Anthropic's latest base model Claude Opus 4.6, ranks third.
There are also some familiar Agents or institutions on the list, such as Warp, Coder, Google, Princeton, etc.
(Screenshot from the official website)
Terminal Bench covers a wide range of task types, including complex system operations and a large number of coding tasks that need to be completed in a real terminal environment.
The core focus of CodeBrain-1 is "whether the code can be correctly written and run."
In terms of technical implementation, CodeBrain-1 focuses on refining two key aspects that directly affect "whether the task can be successfully and efficiently completed."
- Useful Context Searching: Only use "truly useful" context. In complex tasks, more information is not always better; instead, relevance matters. Reducing noise can effectively avoid the hallucination problem of LLMs. CodeBrain-1 will make full use of the functions of LSP (Language Server Protocol) based on the current task requirements and the existing Code Base index to improve the retrieval efficiency of relevant information and effectively assist the code generation process. For example, when planning tasks for a game bot, we need to first understand how to use the bot's API. During the coding process, CodeBrain-1 accurately obtains the signatures, documentation, and usage examples of relevant methods such as move_to(target) and do(action) through LSP Search, effectively reducing the loss of relevant information retrieval and context interference.
- Validation Feedback: Turn failures into valuable information. CodeBrain-1 can efficiently locate errors from LSP Diagnostics and supplement relevant code and documentation, effectively reducing the Generate -> Validate cycle. For example, when there is a problem with the parameter exec type in the code written by CodeBrain-1 when calling on(observation, exec) (a method for defining Bot Reaction), in addition to reporting the error "argument type mismatch," LSP will also provide additional auxiliary information such as caller examples of the method, documentation related to the wrong parameter, and how the parameter exec is used in the implementation.
- The team selected a more focused subset from Terminal Bench, a total of 47 tasks, all of which can be completed using a single programming language (Python). In this subset, CodeBrain-1 also demonstrated stable and consistent task completion capabilities: more efficient retrieval of relevant code and documentation; faster problem location when code inspection and validation failed.
In addition, CodeBrain-1 also showed excellent performance in terms of Token consumption, continuously reducing user costs.
Compared with the technical documentation released by Anthropic, when both use the claude opus 4.6 as the base model, the total Token consumption of CodeBrain-1 and Claude Code in the Py Tasks sub - tasks where both are successful is significantly reduced by more than 15%.
CodeBrain-1's strong performance on Terminal-Bench 2.0 is not only reflected in its end-to-end task execution ability in the real command-line interface (CLI) environment.
More importantly, the team further endowed it with a higher-level ability - a "brain" that can dynamically adjust plans and strategies. By optimizing the task execution logic and error feedback mechanism, it significantly improved the model's operation success rate in the real terminal environment.
CodeBrain-1 proposed a different solution. Instead of letting AI "do as it pleases," it adjusted the division of labor.
CodeBrain-1 is responsible for dynamically generating executable programs corresponding to "intelligence" within these constraints and continuously adjusting them based on actual feedback.
The "plans and strategies" here can apply to both the individual level and the group level.
For an individual, it means that a role can continuously adjust its schedule, behavior choices, and attitude towards others based on its own goals, memory, and observation results. For a group, it means that an organization can form shared memory and adjust its overall planning and response rules based on changes in external conditions.
To more intuitively demonstrate the capabilities of CodeBrain-1, the team put it into a game scenario as a behavior and strategy generation engine.
#Case 1: Real-time driving of game bots
In some open-world games, it can act as a game partner. Players can express their intentions in natural language and let the bot execute them. From understanding the needs in natural language - "Build me a house," "Make a pickaxe," to planning action plans - "Collect resources," "Clean up the work environment," "Build/Make," and finally generating and executing a complete action script to achieve the goal, it can handle tasks in an orderly manner, enriching the players' game experience.
#Case 2: Tactical evolution driven by group memory
In "search, attack, retreat" games, if a player always follows a habitual route and is observed multiple times, the enemy group can gradually strengthen this "group memory."
During the subsequent map construction and deployment phase, the system will adjust the overall strategy accordingly, for example:
At the same time, behavior expression rules can be added to enhance the sense of immersion. When a player is successfully found in a hot area, it can shout "Got you!" or when encountering a player in an unexpected area, it can shout "Wrong prediction!" Furthermore, simple squad combat strategies can be configured, such as frontline charging and rear - line covering.
This type of behavior is not a single - point script but the result of dynamic generation by group strategies.
Why are AI giants competing on Terminal-Bench 2.0?
Terminal-Bench is an open-source benchmark jointly developed by Stanford University and the Laude Institute, and is recognized as the "gold standard" for the end-to-end execution ability of AI agents in a real command-line interface (CLI) environment.
Different from the code generation tests that are just theoretical, its strictness lies in:
- Closed-loop real - world environment: In an isolated Docker container, AI must complete compilation, debugging, training, and deployment in a real Linux ecosystem like a human expert.
- High - pressure long - term tasks: 89 in - depth scenarios span software engineering and scientific computing, requiring extremely high logical span and completely eliminating simple "pattern matching."
- Zero - tolerance verification: Using the 0/1 judgment criterion, only when the expected deliverables (such as fixed code or running services) are produced can it be considered a pass, with no "fuzzy scores."
- The "ceiling effect" of 2.0: The upgraded version 2.0 has significantly raised the bar. Currently, the solution rates of the world's top models generally struggle to exceed 65%, which has become the "deep - water area" for large models to handle system - level complex tasks.
CodeBrain-1's second - place global ranking in its first appearance speaks volumes about its high value.
Taking the GPT series as an example, although top models have a strong logical reasoning chain, they often have long execution chains due to "over - thinking."
CodeBrain-1 is not an AI that "talks better" but an execution - type brain composed of code that can continuously adjust plans and strategies. It skillfully plays the roles of a "scheduling center" and an "efficiency calibrator": it guides the model to maintain a rapid response in routine operations and only activates in - depth thinking when encountering critical errors.
This precise control of the base model is precisely the core variable that differentiates the commercial implementation of large models.
Robust closed - loop error correction (Error Recovery), efficient task decomposition (Sub - goal Decomposition), and accurate understanding of environmental perception are still the "necessary path for model implementation" in the commercial landscape of AGI.
It is not only about the accuracy of task decomposition but also about the resilience to correct errors and survive in a closed - loop environment.
Sam Altman's declaration after the release of GPT-5.3-Codex also confirms this trend: Codex has evolved from a single code review tool to an "all - around agent" that can perform all computer operations of professionals across the entire life cycle.
In OpenAI's blueprint, the model and the framework are evolving into a deeply integrated "intelligent package."
Even with giants all around, there is still a huge commercial dividend for excellent engineering frameworks in the deep - water areas of vertical industries.
Whether it is a system - level Agent framework or a powerful developer efficiency tool, these "closer - to - user" touchpoints all have the potential for explosive growth.
As a Chinese startup team, Feeling AI was able to complete in - depth integration immediately after the release of OpenAI's cutting - edge model and achieve globally leading results. This is not only a victory in engineering response speed but also a strong proof that Chinese AI teams have occupied a high point in global engineering collaboration.
On the hardcore track of Terminal-Bench 2.0, which is known for its "real - world environment and long - term evolution," winning the second place globally right after OpenAI has obvious symbolic significance: Chinese startup teams have taken the lead in crossing the gap from "conversational toys" to "productivity tools" for agents and have occupied a leading position in the strategic high - ground