Stanford's latest research: The context of AI is more important than parameters. There's no need for retraining or further fine-tuning.
Recently, Stanford University collaborated with SambaNova Systems to publish the paper "Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models."
The paper proposes a framework called ACE (Agentic Context Engineering), which enables AI to achieve self-improvement without retraining its weights.
Paper link: http://arxiv.org/abs/2510.04618v1
The core idea of the paper is that the capabilities of large models are not solely determined by parameters but more by the "quality of context." In other words, whoever can construct the optimal context can make the model smarter.
The core idea of ACE is to make the model no longer rely on "static prompts" but shift towards a dynamic, structured, and evolvable "knowledge playbook."
These playbooks record the strategies, rules, templates, and correction rules accumulated by the model during task execution. Every success or failure is converted into an "incremental update" (delta).
Different from traditional "rewriting prompts," ACE continuously improves the playbook through small and safe updates rather than starting over from scratch.
This mechanism means that AI can learn, remember, and improve during operation without any parameter fine-tuning.
ACE Framework
The researchers pointed out that this mechanism can avoid two fatal problems: one is brevity bias, which means losing key details in the pursuit of concise optimization; the other is context collapse, which refers to the knowledge damage caused by rewriting.
The paper gives an example that in the experiment, an AI agent accumulated a context of 18,000 tokens and performed well. However, when the model tried to "summarize and compress" it, the playbook was reduced to only 122 tokens, and the performance instantly dropped to 57.1%.
The researchers said bluntly, "The model is good at using knowledge but not good at organizing it. One wrong rewrite may destroy all the accumulations."
The paper claims that ACE solves the structural risk of this "self-destructive learning."
Caption: The ACE framework significantly outperforms other methods in three types of tasks (agent operations, domain knowledge, and numerical reasoning), with the most obvious improvement in accuracy.
Three-Role Collaboration: Generation, Reflection, and Curation
The ACE system is built on a minimalist philosophy: Don't rewrite knowledge; manage it.
The whole system is disassembled into three complementary roles.
The first is the Generator. It is responsible for executing tasks, interacting with the environment, and generating reasoning processes, code, or operation sequences.
The second is the Reflector. It analyzes the action trajectory of the Generator, identifies the reasons for success and failure, and extracts "actionable lessons." These feedback signals may come from code errors, execution results, or external labels.
The third is the Curator. It refines these experiences into structured entries (delta context) and integrates them into the main playbook through deterministic rules (non-language model decisions).
This three-layer cycle - action, reflection, and integration - constitutes the learning closed-loop of ACE.
Each update only affects local entries without touching the overall text. This local incremental mechanism allows the knowledge base to expand continuously without collapsing.
The playbook itself is designed in a project-based structure: including strategy rules, API call templates, debugging experiences, solutions to common errors, etc. Each entry is accompanied by usage counts and positive/negative feedback metadata.
The Reflector will judge which rules are effective and which are useless based on these records. The Curator will then modify or delete them accordingly.
The paper claims that this way makes the AI's knowledge "evolve like a Git repository," enabling it to grow safely, be trimmed carefully, and be traced transparently.
The researchers emphasized that the complexity of ACE is not a burden but a structured safety mechanism, which exchanges a small system overhead for the stable accumulation of knowledge.
Small Models "Defeating Higher-Level Opponents": DeepSeek Beats GPT-4.1
In the complex AppWorld agent task, the ACE framework brings an average performance improvement of +10.6% and reduces the adaptation delay by 86.9%.
The research team specifically mentioned that this improvement does not rely on a larger model but comes from better context management.
A typical example is DeepSeek V3.1, which has a lower parameter count than GPT-4.1. However, under the ACE framework, it can match the GPT-4.1 agent (IBM CUGA) and even outperform it in more complex test sets in the AppWorld benchmark test.
The researchers pointed out that this result shows that "context engineering" has become a new equalizer for computing power.