HomeArticle

New breakthrough in general self-evolving agents: 30k context is enough, and token consumption is also reduced by nearly 90%.

账号已注销2026-04-28 19:37
Method: Maximize the density of contextual information.

Currently, the performance of long-horizon Agents is fundamentally limited by the context, facing issues such as "context explosion" and "difficulty in precipitating cross-task experience."

Specifically, tool descriptions, retrieval memories, and original environmental feedback continuously accumulate during multi-step interactions, pushing the information truly needed for decision-making out of the effective attention scope. Meanwhile, the successful path of each task is forgotten after the session ends, and similar tasks still require exploration from scratch when they arrive. Ultimately, the Token consumption increases linearly with the number of tasks, while the capabilities remain stagnant.

To address these challenges, the A3 Laboratory (Advantage AI Agent Lab) team proposed GenericAgent (GA), a general self-evolving LLM Agent system constructed around the single principle of "maximizing the density of context information."

Paper link: https://arxiv.org/pdf/2604.17091

It is reported that in terms of task completion rate, tool usage efficiency, memory effectiveness, self-evolving ability, and web browsing, GA's performance exceeds that of mainstream Agent systems. Meanwhile, it consumes fewer tokens and has fewer interaction rounds, and can continuously evolve over time.

For example, on Lifelong AgentBench, GA achieved a 100% completion rate with 222k input tokens (only 27.7% of Claude Code and 15.5% of OpenClaw). In the 9-round repeated GitHub research task, the token consumption decreased by 89.6%, and the number of calls converged from 32 to 5.

I. Maximizing the density of context information

According to the paper, the research team decomposed the context quality into Completeness, Conciseness, and Naturalness as a constraint.

Among them, Completeness requires that the information needed for the current decision explicitly exists in the context, avoiding the model relying on implicit assumptions or hallucinatory inferences. Conciseness requires removing irrelevant and redundant content, allowing attention to focus on the key signals for decision-making. Naturalness constrains the representation form, as excessive compression or artificial coding will make it more difficult for the model to parse.

They found that the main structural tension lies between the first two: including more potentially relevant information will improve Completeness but weaken Conciseness. Even if the context window is infinite, this tension will not disappear.

Based on this principle, they constructed GA based on the "minimal atomic toolset", "hierarchical on-demand memory", "self-evolution", and "context truncation and compression".

Figure | Overall architecture of GA. GA follows a unified Agent loop, constructs an execution context based on the current task and relevant memories, generates outputs or tool calls, and updates the system through structured feedback.

1. Minimal atomic toolset: Replace exhaustive listing with combination

GA's 9 atomic tools are distributed in 5 capability domains: file_read, file_patch, and file_write are used for file reading, writing, and precise editing; code_run executes Python or Bash in a controlled runtime; web_scan and web_execute_js are responsible for web page inspection and browser operations; update_working_checkpoint and start_long_term_update maintain short-term context and long-term memory distillation; ask_user handles human-in-the-loop decision-making.

In contrast, Claude Code exposes 53 tools at the source code level, and OpenClaw exposes 18 tool factories, with plugins injected at runtime.

The research team believes that an increase in tools will bring costs at two levels: at the Prompt level, each new tool expands the schema and description, squeezing the effective context budget; at the strategy level, the action space expands, and the ambiguity of tool selection increases, making planning more fragile.

The choice of tool minimization is based on two conditions: atomicity (each tool corresponds to an irreducible basic ability) and combinatorial generalization (complex behaviors are achieved through a sequence of atomic tools).

Theoretically, only the code_run tool can simulate the other 8 tools. The remaining tools are not for expanding capabilities but as shortcuts to reduce decision-making costs.

Meanwhile, further optimization has been done inside each tool: file_patch enforces unique matching of old_content, and fails quickly if there is zero or multiple matches; web_scan has a built-in layout analysis algorithm, clones the DOM, calculates the visibility of each element, removes covered or hidden elements, and reduces the token consumption by an order of magnitude compared to the original DOM.

2. Hierarchical on-demand memory: Keep the L1 index bounded

GA divides memory into 3 functional forms, namely working memory, resident memory, and long-term memory, and defines four specific levels at the implementation layer: the L1 index layer, the L2 fact layer, the L3 SOP layer, and the L4 original session archive layer. Among them, L1 is the resident part, L2 and L3 together form the long-term memory, and L4 is responsible for persistence and traceability.

A key design is that L1 only records the "existence of a certain type of knowledge" and does not record its content. New entries are only added when a truly new category appears, and the overall description length approaches the Kolmogorov complexity corresponding to the category structure of the knowledge set. The LLM itself acts as a decoder. Once it infers the existence of a certain ability or fact, it uses tool calls to retrieve the complete content from deeper layers. This allows L2 and L3 to grow infinitely, while L1 always remains compact.

In addition, the research team also introduced a meta-memory layer, defining the overall memory map, core rules, and update boundaries to prevent arbitrary writing, misreading of history, and cross-task leakage. The complete meta-SOP content is loaded on-demand through file reading and is not pre-set by default. Long-term precipitation adopts triggered submission rather than immediate writing: information first enters the verification stage. After being confirmed as valid and reusable, it is written into L2 or L3 in small increments, and the L1 index is updated accordingly.

3. Self-evolution: From plain text SOP to executable code

GA separates the tool layer from the knowledge layer. The tool interface remains stable for all tasks, and all task-related capabilities are stored as SOP files and reusable scripts. The Agent can use its own tools to read, create, and modify these assets. This separation ensures that learning new tasks will not interfere with existing skills. In multi-round conversations, real feedback will gradually refine the SOP, and common subtasks will naturally evolve into stable and reusable scripts. Knowledge is upgraded from plain text instructions to executable code.

GA saves the original action trajectory in L4 but does not automatically promote it to L2 or L3. Reusable SOPs are only generated in the explicit integration step, triggered by milestone events, such as the achievement of sub-goals or system error recovery. The integration stage strictly follows the "No Execution, No Memory" rule, only retaining the content verified by successful tool execution. Guesses, temporary intermediate states, and failed decision branches will be systematically discarded.

To avoid falling into a loop of repeated errors, GA introduces a three-level failure escalation mechanism: first, make local minor repairs based on error reports; if the failure persists, abandon the current path, switch strategies, or search for missing information; after all automatic attempts fail, pause and request human intervention.

4. Context truncation and compression: Keep the window within 30k

Many Agent frameworks rely on an extended window of 1M tokens, assuming that the longer the context, the better the inference. GA holds the opposite view. The "hallucination-free context length" of the current model is about an order of magnitude smaller than the nominal value. GA sets the budget within 30k tokens and focuses on compression rather than expansion. It consists of four specific stages:

Stage 1 (Tool output truncation): Before each tool return value enters the message history, it is first trimmed according to a character threshold. The code_run is limited to 10,000 characters, and the web_scan in text mode is 10,000 characters. If it exceeds, the first and last L/2 are retained, and the middle part is replaced with an ellipsis.

Stage 2 (Tag-level compression): Triggered about every 5 rounds, repeated working memory blocks are replaced with placeholders; the content of the reasoning and tool tags is truncated to a window of about 800 characters at the beginning and end; the last 10 messages are exempt from compression, and about 80% of the rounds can hit the prompt cache.

Stage 3 (Message eviction): When the total history length exceeds the character budget, first re-run the "Stage 2" compression according to stricter rules, and then delete the oldest messages according to the FIFO principle until the history scale drops below 60% of the total budget.

Stage 4 (Working memory anchor): After each tool call, a one-line summary of the last 20 rounds, the current round number, and the key_info block maintained by the Agent are automatically appended to the next user message. After the eviction in Stage 3, this anchor becomes the only source of long-term memory.

GA's core code is about 3,300 lines, and the central Agent Loop is only 92 lines; while OpenClaw has about 530,000 lines of code, more than 160 times that of GA.

The Agent is exposed externally as a self-hosted CLI program. The command line is not an encapsulation layer of the internal platform but the native execution interface of the system. This extremely minimalist architecture naturally gives rise to multiple capabilities.

  • Subagent dispatching: The parent Agent directly starts multiple GA instances in the background by executing standard terminal commands. Each subprocess runs in an independent memory space with its own context isolation.
  • Reflect mode: A lightweight script periodically checks the trigger conditions. After hitting, it sends the returned string as a new task to the GA CLI. The Watchdog and Scheduled Task share the same mechanism, only with different trigger scripts.

Furthermore, the combination of the two gives rise to the "autonomous exploration ability", where the dispatcher changes from the user to the Agent itself. GA maintains a persistent skill tree, scores candidate tasks in four dimensions: breadth, depth, utility, and innovation, and automatically adjusts the weights through a reflection mechanism based on actual usage.

II. How effective is it?

The research team systematically evaluated GA in 5 dimensions.

1. Task completion and token efficiency

The research team compared GA, Claude Code, OpenClaw, and Codex on SOP-Bench, Lifelong AgentBench, and RealFin-benchmark.

The results show that based on Claude Sonnet 4.6, GA achieved a 100% completion rate on the first two benchmarks, reaching or exceeding the SOTA baseline.

Among them, GA's input tokens on Lifelong AgentBench were only 222k, far lower than Claude Code's 800k and OpenClaw's 1.43M; GA achieved a 65% comprehensive accuracy rate on RealFin-benchmark, exceeding Claude Code (Opus 60%, Sonnet 55%), Codex (60%), and OpenClaw (35%).

Table | Task completion rate and token efficiency in the main Agent benchmarks and RealFin benchmark

2. Tool usage efficiency

In 5 long-range complex tasks, both GA and Claude Code achieved a 100% success rate, but the total token consumption of GA was only 35.1% of Claude Code's; moreover, the number of requests decreased from 32.6 to 11.0, and the tool calls decreased from 22.6 to 12.8.

Table | Results of long-range complex tasks

3. Effectiveness of the memory system

The research team also conducted a memory ablation experiment on the SOP-Bench dangerous_goods subset. The results show that GA's task success rate was 13.87% under No-Memory, 52.44% under Full-Memory (memory scale of 575 tokens), and reached 66.48% under Condensed Memory with only 165 tokens, the same score as Redundant-Memory using 288 tokens.

Table | Memory ablation experiment of SOP-Bench dangerous_goods

In the LoCoMo long-term fact memory evaluation, GA achieved SOTA F1 and BLEU-1 scores in all four categories: Multi-Hop, Temporal, Open-Domain, and Single-Hop. The F1 on multi-hop reached 43.33, exceeding Mem0 (39.32) and A-MEM (29.03), and it does not rely on any embedding model or vector database.

Table | LoCoMo long-term fact memory evaluation

After loading the same 20 skills and using them extensively, after sending only one "Hello", the complete prompt length of GA was only 2298 tokens, while those of Claude Code, Codex, and OpenClaw were 22821, 23932, and 43321 respectively.

4. Self-evolving ability

Taking the LangChain GitHub research task as the longitudinal tracking target, GA converged from 7m30s, 32 LLM calls, and 222203 tokens to 1m38s, 5 calls, and 23010 tokens within 9 rounds. The time decreased by 78.2%, the calls decreased by 84.4%, and the tokens decreased by 89.6%. From the 6th to the 9th round, it stabilized in the range of 23k ± 1k tokens. The input tokens decreased from 15581 to 1323, and