Just Now: First Achievement of Tencent's Yao Shunyu Team Released, Unveiling Real Bottlenecks of Large Models

Tencent Hunyuan Technology Blog is publicly unveiled for the first time.

According to a report by Zhidx on February 3rd, just now, the official website of Tencent Hunyuan was officially launched. The latest achievements of the Yao Shunyu team were released, introducing the benchmark CL-bench specifically designed to evaluate whether large language models can learn new knowledge from context and apply it correctly.

This is the first research achievement released by the team after Yao Shunyu joined Tencent Hunyuan as the Chief AI Scientist, and it is also the first public release of the Tencent Hunyuan technical blog.

Tencent Hunyuan technical blog and the acknowledgement section

The key difference between large models and humans when solving problems is that large models can only rely on the static memory from the pre - training stage, while humans can complete tasks according to the on - site situation in real - time. The Tencent Hunyuan research team found through actual testing that almost all current SOTA models cannot learn from context. Even the best - performing GPT - 5.1 (high) only has a task success rate of 23.7%.

Based on this, the team created CL - bench with only one core goal: to require the model to learn new knowledge that does not exist in the model's pre - training from context and apply it correctly when solving each task.

Address of the Tencent Hunyuan technical blog:

https://hy.tencent.com/research

Project homepage:

www.clbench.com

01. Free large models from rote memorization. The new benchmark includes 500 complex context tasks

In the past few years, large language models have made rapid progress. They can solve Olympiad - level problems, deduce complex programming logics, and even pass professional qualification exams that humans need to study hard for several years to obtain. However, there is a key threshold behind this. Even if a large model can get a full score in an exam, it may not be competent for real - world work.

Humans can learn from the immediate environment in real - time when performing tasks. However, large language models mainly rely on "parameterized knowledge", which is the static memory compressed into the model weights during the pre - training stage. During inference, the model mostly calls on this stored internal knowledge rather than actively absorbing nutrients from the new input information.

Therefore, currently optimized models are good at reasoning about things they "know", but what users need is for the model to solve tasks that depend on messy and dynamically changing contexts.

Based on this, the Hunyuan researchers hope to bridge this gap and fundamentally change the model's optimization direction. They constructed the benchmark CL - bench specifically to evaluate whether large language models can learn new knowledge from context and apply it correctly.

Paradigm shift of large language models

CL - bench includes 500 complex contexts, 1899 tasks, and 31607 verification criteria created by experts. Its requirement for the model is that the model must learn new knowledge that does not exist in the model's pre - training from context and apply it correctly when solving each task.

The knowledge that the model needs to learn is very extensive, including new domain knowledge, unfamiliar rule systems, complex product workflows, and even laws or conclusions that must be deduced and summarized from experimental data.

All this knowledge is either newly constructed entirely by domain experts or taken from niche and long - tail sources that are unlikely to appear in the training data of current cutting - edge models. Therefore, the model cannot solve tasks by recalling static parameterized knowledge. Instead, it is required to learn from and apply the provided context.

Specifically, CL - bench covers four broad real - world context learning scenarios:

Context classification system of CL - bench.

Domain knowledge reasoning: The context provides specific domain knowledge, such as a fictional legal system, innovative financial instruments, or niche professional knowledge. The model needs to use this knowledge to reason and solve specific problems.

Rule system application: The context provides a newly defined formal system, such as new game mechanics, mathematical formal systems, programming syntax, or technical standards. The model must understand and apply these rules to perform tasks.

Procedural task execution: The context provides a complex process system, such as workflows, product manuals, and operating instructions. The model must understand and apply this procedural information to complete tasks.

Empirical discovery and simulation: The context provides experimental data, observation records, or simulation environments within a complex system. Different from the previous categories that involve deductive reasoning, this category focuses on inductive reasoning. The model must discover potential laws or conclusions from the data and apply them to solve tasks.

Examples of CL - bench. Solving these tasks requires large language models to learn from the provided context

These categories include most of the deductive and inductive reasoning tasks common in real - world work and can measure the model's context learning ability.

02. The model success rate is only 17.2%. Five key conclusions are drawn

The researchers evaluated ten mainstream large language models on CL - bench.

On average, the models only solved 17.2% of the tasks, and GPT - 5.1 (High) solved 23.7% of the tasks.

In other words, although the context contains all the information needed to solve each task, the models failed in most tasks. This indicates that almost all current SOTA models cannot learn from context.

Task solution rates of ten cutting - edge models on CL - bench

The Hunyuan research team drew several key conclusions:

1) Ignoring or misusing context is the main reason for failure.

Many errors do not result from a lack of information but from the model ignoring key details in the context or misapplying them. In many cases, the model only uses the static knowledge learned during pre - training to solve tasks. Even if the context clearly defines new rules, concepts, or procedures, the model will not learn and utilize them.

Distribution of error types of each model

2. Long - context reasoning and instruction following are necessary but not sufficient conditions.

Case studies show that models that have difficulty tracking dependencies across long contexts or precisely following constraints often perform worse. However, even models that can handle long inputs and reliably follow instructions still fail in many tasks. Context learning requires far more than just the ability to understand long contexts and follow instructions.

3. Inductive reasoning from experimental data and environmental simulations is more difficult than deductive application.

Deductive tasks require the model to apply rules and processes clearly given in the context, while empirical discovery and environmental simulation tasks require inductive reasoning, that is, summarizing laws from data or exploring in a virtual environment. The models' performance in these tasks is significantly worse, with the task solution rate usually below 10%, and the results fluctuate greatly. This shows that discovering laws is far more challenging than applying rules.

Comparison of GPT - 5.1's performance in each sub - category under high/low reasoning intensity settings

4. Higher reasoning intensity usually improves context learning effects.

For some models, increasing the reasoning intensity can improve performance, enabling the model to understand complex contexts more deeply. For example, GPT - 5.1's performance in management and experimental data tasks increased by about 6%. However, the improvement of other models is limited, and it may even decline, indicating that more reasoning alone is not enough. The model must also be able to correctly absorb and organize context information.

Trend of the model's context learning performance under different input lengths

5. The difficulty of context learning is related to the context length, but short contexts can also be complex.

Longer contexts usually make tasks more difficult for all models, which verifies that long - context processing remains a key bottleneck. However, even short contexts can be very challenging if they contain information - dense, rule - implicit, dependency - complex, or constraint - strict content, indicating that the difficulty of context learning comes not only from the length but also from its complexity.

CL - bench fully explains why large language models often make mistakes in real - world scenarios: even with context engineering and the necessary context prepared for the model, the model may still fail. If the model cannot truly learn from it, simply providing context is not enough. Context learning, as a fundamental learning ability of the model, has been largely ignored.

03. All contexts are self - contained, and test tasks adopt a contamination - free design

Each context in CL - bench is completely self - contained. All the information needed to solve the task is explicitly provided in the context itself: no external retrieval is required, and no hidden assumptions are allowed.

Solving tasks in CL - bench requires the model to learn new knowledge from the corresponding context

To ensure that the performance truly reflects context learning rather than memory or data leakage, CL - bench adopts a contamination - free design:

Fictional creation: Experts create completely fictional content, such as designing a complete legal system for a fictional country, including novel case precedents and legal principles, or creating a new programming language with unique syntax and semantics.

Modification of existing content: Experts modify real - world content to create variants, such as changing historical events, altering scientific and mathematical definitions, or modifying technical documents and standards.

Integration of niche and emerging content: Experts incorporate niche or recently emerging content that is extremely under - represented in the pre - training dataset, such as cutting - edge research findings, newly released product manuals or technical documents, and specific knowledge from specialized fields.

Without providing any context, GPT - 5.1 (High) can only solve less than 1% of the tasks. This also proves that the data is contamination - free. If the model does not learn from the context, it can hardly solve these tasks at all.

In addition, CL - bench is designed with high complexity and sequence dependence. Among them, 51.1% of the tasks require sequence dependence, which means that the solutions to subsequent tasks depend on the results of early interactions. This multi - round design increases the task difficulty.

On average, domain experts spend about 20 hours annotating each context to ensure the quality and depth of task construction.

Meanwhile, each task in CL - bench is fully verifiable. Each context is associated with an average of 63.2 verification criteria, and each task contains 16.6 evaluation criteria.

04. Conclusion: How large models memorize will become the core theme in 2026

The Hunyuan technical blog also mentioned the subsequent focus of the Hunyuan research team, including how to improve the model's context learning ability and how to make the knowledge learned by large models from context persistent.

If the context learning ability of the model can be improved like other previous abilities, the role of humans in AI systems will change: humans will no longer be the main data providers but will become context providers. The focus of competition will shift from "who can train the model better" to "who can provide the richest and most relevant context for the task".

They believe that how large models memorize is likely to become another core theme in 2026. To fully unleash the potential of large language models, new architectures and new optimization methods may be needed to determine "what to retain".

In the future, once context learning and memory of large models become reliable, the models may be able to achieve autonomous learning. They will autonomously prepare context, learn from it, and self - consolidate.

This article is from the WeChat official account

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Just now, the first achievement of Tencent's Yao Shunyu team was released, revealing the real bottleneck of large models.

01. Free large models from rote memorization. The new benchmark includes 500 complex context tasks

02. The model success rate is only 17.2%. Five key conclusions are drawn

03. All contexts are self - contained, and test tasks adopt a contamination - free design

04. Conclusion: How large models memorize will become the core theme in 2026