Is Fine-Tuning Dead? Discover Agentic Context Engineering for Model Evolution Without Fine-Tuning

The ACE framework can achieve scalable and efficient context adaptation, and it is applicable to both offline and online scenarios.

What made an AI automation architect exclaim that "fine-tuning is dead"?

A recent paper from Stanford University, SambaNova, and UC Berkeley has sparked extensive discussions. They proposed a technology called Agentic Context Engineering (ACE), which enables language models to improve themselves without fine-tuning!

Paper title: Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Paper link: https://www.arxiv.org/abs/2510.04618

It all starts with context adaptation

Contemporary AI systems based on large language models (LLMs), such as LLM agents and composite AI systems, are increasingly relying on context adaptation.

Specifically, context adaptation is to improve model performance by introducing clearer instructions, structured reasoning steps, or domain-specific input formats into the input after the model is trained. Obviously, this is very different from the fine-tuning method that directly modifies model parameters.

We know that context forms the basis of many AI system components, including: system prompts that guide downstream tasks, memory mechanisms that carry past facts and experiences, and factual evidence that reduces hallucinations and supplements knowledge.

Compared with parameter updates, adapting through context has several core advantages: context is more interpretable for users and developers; it can quickly integrate new knowledge at runtime; and it can be shared among multiple models or modules in a composite system. Meanwhile, the progress of long-context language models and efficient reasoning mechanisms (such as KV cache reuse) have also made context-based methods more feasible in practice. Therefore, context adaptation is gradually becoming the core paradigm for building high-performance, scalable, and self-improving AI systems.

However, existing context adaptation methods still have two major limitations.

The first is the "brevity bias": Many prompt optimizers tend to pursue concise and universal instructions while ignoring the full accumulation of knowledge. For example, GEPA considers brevity an advantage, but this abstraction may omit crucial domain heuristics, tool usage guidelines, or common error patterns in practice. Although such optimization goals may work on some metrics, they often fail to capture the detailed strategies required by agents or knowledge-intensive applications.

The second is the "context collapse": Relying on LLMs to rewrite the entire prompt often degenerates into shorter and more ambiguous summaries over time, resulting in a sharp decline in performance (see Figure 2). In tasks such as interactive agents, domain-specific programming, and financial or legal analysis, system performance depends on retaining detailed, task-related knowledge rather than compressing it.

As the reliability requirements of agents and knowledge-intensive reasoning increase, recent research has gradually shifted towards building "information-saturated" contexts, which means leveraging the progress of long-context language models to accommodate more potentially useful information.

However, the joint team from Stanford University, SambaNova, and UC Berkeley believes that context should not be a short summary but a comprehensive and dynamically evolving "playbook" - detailed, inclusive, and rich in domain insights. Different from humans, LLMs perform better when provided with long and detailed contexts and can autonomously extract key information. Therefore, instead of compressing domain heuristics and strategies, they should be retained, allowing the model to decide which information is most important during reasoning.

Based on this insight, Agentic Context Engineering (ACE) emerged.

Agentic Context Engineering (ACE)

The ACE (Agentic Context Engineering) framework proposed by the team can achieve scalable and efficient context adaptation, and it is applicable to both offline (such as system prompt optimization) and online (such as memory adaptation during testing) scenarios.

Different from previous methods that distill knowledge into short summaries or static instructions, ACE views context as an evolving playbook that can continuously accumulate, distill, and organize strategies.

Based on the agentic architecture of Dynamic Cheatsheet (see arXiv:2504.07952), ACE introduces three collaborative roles:

Generator: Generates reasoning trajectories;
Reflector: Distills specific insights from successes and errors;
Curator: Integrates these insights into structured context updates.

This design mimics the human learning method of "experiment - reflect - integrate" and avoids the bottleneck caused by having a single model perform all functions.

To address the brevity bias and context collapse problems mentioned above, ACE introduces three key innovations:

Specialized Reflector module: Decouples the evaluation and insight extraction from the curation process, improving context quality and downstream performance;
Incremental Delta update mechanism: Replaces the overall rewrite with local edits, significantly reducing latency and computational overhead;
grow-and-refine mechanism: Suppresses redundancy while continuously expanding, achieving the steady-state evolution of context.

In the workflow, the Generator first generates reasoning trajectories for new tasks, revealing effective strategies and common pitfalls; the Reflector analyzes these trajectories, extracts experiences, and can optimize them through multiple iterations; the Curator then integrates these experiences into compact incremental entries (delta entries) and merges them into the existing context through a lightweight, non-LLM logical mechanism.

Since the update entries are localized, multiple increments can be merged in parallel, enabling batch adaptation and scalability. ACE also supports multi-epoch adaptation, allowing the same task to be revisited multiple times to continuously strengthen the context.

Incremental Delta update

The core design concept of ACE is that context is represented as a structured set of entries (bullets) rather than a single monolithic prompt.

Each entry contains two parts:

Metadata: A unique identifier and a "useful / harmful" counter;
Content: Such as reusable strategies, domain concepts, or common error patterns.

When solving new problems, the Generator marks which entries are helpful or misleading, providing a basis for the Reflector to improve.

This entry-based design brings three characteristics:

Localization: Only updates relevant entries;
Fine-grained retrieval: The Generator can focus on the most relevant knowledge;
Incremental adaptation: Can efficiently perform merging, pruning, and deduplication during reasoning.

ACE does not rewrite the entire context but generates compact incremental contexts (delta contexts): a small set of candidate entries refined by the Reflector and integrated by the Curator.

This approach avoids the high computational cost and latency of overall rewriting while maintaining old knowledge and continuously absorbing new insights. As the context grows, this mechanism provides the necessary scalability for long-term or high-knowledge-density tasks.

Grow-and-Refine

On the basis of continuous growth, ACE ensures that the context remains compact and relevant through regular or delayed distillation.

In the Grow-and-Refine process, new entries are appended to the context, and existing entries are revised in-place through metadata updates (such as incrementing the counter).

The deduplication step eliminates redundancy by comparing the similarity of entries through semantic embeddings.

This process can be actively executed after each incremental update or passively triggered when the context window is exceeded, depending on the latency and accuracy requirements.

The incremental update and Grow-and-Refine mechanisms together maintain the dynamic scalability and high relevance of the context.

How effective is ACE?

The team conducted experiments to verify the newly proposed method.

Specifically, they conducted experiments on two types of tasks: agent tasks and domain-specific tasks.

Agent tasks used the AppWorld benchmark, which covers complex behaviors such as multi-round reasoning, tool invocation, and environment interaction. It includes scenarios of different difficulties (normal and challenge modes) and has a public leaderboard to evaluate the real performance of agents.
Domain-specific tasks focused on financial analysis, using the FiNER and Formula datasets: the former requires identifying fine-grained entity types in XBRL financial reports, and the latter examines the model's numerical reasoning and calculation ability in structured financial reports.

The baseline methods for comparison include the following:

ICL (In-Context Learning): Achieves few-shot learning by providing example demonstrations in the input;
MIPROv2 and GEPA: Two mainstream prompt optimization algorithms based on Bayesian optimization and reflective evolutionary strategies respectively;
Dynamic Cheatsheet (DC): A memory adaptation mechanism during testing that can accumulate reusable strategies and knowledge.

In comparison, under the same base model and operating conditions, ACE achieved higher accuracy, faster adaptation speed, and lower computational cost through its "generate - reflect - integrate" active context engineering framework.

After the experiments, ACE performed excellently, and the following figure shows its overall performance - there is no doubt that it has obvious advantages.

First, ACE can indeed achieve high-performance, self-improving agents.

Through dynamic optimization of the input context, ACE achieved self-improvement of agents. On the AppWorld benchmark, without labeled data, ACE could improve performance by up to 17.1% based on execution feedback alone, making the performance of open-source small models close to that of the strongest commercial systems.

The following figure shows an example (partial) of the context generated by ACE on the AppWorld benchmark. It can be seen that the context generated by ACE contains detailed, domain-specific insights, as well as directly usable tools and code, forming a complete "playbook" for large language model applications.

At the same time, ACE can also significantly improve performance on domain-specific tasks: in complex financial reasoning tasks, ACE improved the average performance by 8.6% by building a "playbook" rich in domain knowledge.

The team also verified the effectiveness of their new design through ablation experiments, and the results showed that components such as the Reflector and multi-round distillation are crucial for performance improvement.

Finally, the team also analyzed the cost and latency of ACE and found that both indicators decreased significantly: through incremental updates and a lightweight merging mechanism, ACE reduced the adaptation latency by an average of 86.9% and reduced the generation consumption.

As for whether ACE can really make "fine-tuning dead", it's up to you, the reader, to judge, as this research has also received some criticism online.

Conclusion

The team summarized: "Long context ≠ higher serving cost." Although the context generated by ACE is longer than that of methods such as GEPA, it does not lead to a linear increase in reasoning cost or memory usage.

Modern serving infrastructures have optimized long-context loads through mechanisms such as KV cache reuse, compression, and offloading, allowing commonly used context fragments to be cached and avoiding repeated calculations. With the continuous improvement of system-level optimization, the actual deployment cost of long-context methods (such as ACE) will further decrease.

At the same time, the team also analyzed the implications of this research for online and continuous learning.

Online learning and continuous learning are important directions for dealing with distribution shifts and the limited nature of training data. ACE provides a flexible and efficient alternative to traditional model fine-tuning: updating context is usually less costly than updating model parameters, is interpretable, and may enable selective unlearning - which can be used for privacy protection, compliance, and removing incorrect or outdated information.

The team believes that ACE is expected to become one of the core mechanisms for promoting