Does 35B Agent Outperform Trillion-Parameter Models? Shanghai AI Lab Open-Sources Agents-A1 to Expand the Horizon

You can be powerful without piling up specs.

Long-Horizon Tasks are one of the most pressing challenges that current AI Agents need to overcome.

In scenarios such as software engineering, scientific research, and complex decision-making, Agents often need to make continuous decisions over long horizons. Any single misstep can have a cascading effect on subsequent tasks. In the past, these capabilities often relied on larger models. Extending the Agent Horizon is another important avenue, but it has been hampered by inadequate infrastructure and the difficulty of unifying heterogeneous capabilities.

In response to these challenges, the Shanghai AI Lab team has introduced the MoEAgent model, Agents-A1, with 35 billion parameters. Instead of simply increasing the number of parameters, the team aims to achieve long-horizon performance comparable to trillion-parameter models by extending the Agent Horizon with a smaller model.

Paper Link: https://arxiv.org/abs/2606.30616

Research results show that Agents-A1 outperforms some trillion-parameter models in tasks such as multi-step search, scientific research, and long-instruction following. It also leads among models of the same 35B scale.

Figure: Benchmark performance of Agents-A1.

However, the research team also noted that Agents-A1 still lags behind cutting-edge large models in engineering tasks.

This research presents a more cost-effective approach to developing powerful AI Agents: teaching them to develop sustainable and proven work habits, rather than simply increasing their parameter scale.

How is Agents-A1 Designed?

Agents-A1 is a 35B-parameter MoE Agent model tailored for long-horizon tasks. Leveraging a long-horizon knowledge-action infrastructure, it integrates various Agent capabilities into a single model through a three-stage training process: Full-Domain Supervised Fine-Tuning (SFT), Domain-Specific Teacher Model Training, and Multi-Teacher On-Policy Distillation (OPD). The specific process is as follows:

1. Full-Domain Supervised Fine-Tuning (SFT)

This stage aims to establish the model's general Agent capabilities. The research team trained the model using high-quality long-horizon trajectory data from multiple domains and tasks, enhancing its understanding, reasoning, and instruction-following abilities in long-context scenarios. During training, sample packing was employed to concatenate multiple shorter samples into a single training sequence, and attention masks were used to prevent interference between samples, reducing padding overhead and improving GPU utilization.

2. Domain-Level Teacher Model Training

The research team decomposed the model's capabilities into four types of specialized teachers: Search, Scientific Reasoning, Instruction Following, and Tool Usage, and designed separate training programs for each.

Search Teacher: Adopted a two-stage training approach of "SFT first, then RL" and combined it with GRPO to enhance the ability to decompose complex problems, conduct multi-hop searches, and coordinate tools. The goal was to reduce redundant searches while maintaining accuracy.
Scientific Teacher: Through two-stage SFT, first strengthened scientific derivation abilities, and then enhanced external interaction and evidence integration capabilities through tool-augmented trajectory training. The model learned when to use external tools and how to integrate retrieved or calculated evidence.
Instruction Following Teacher: Adopted two-stage RL and GRPO training: The first stage improved the ability to meet fine-grained constraints such as format, length, keywords, and language; the second stage strengthened the ability to locate evidence, integrate information, and follow context rules in long-context ICL.
Tool Usage Teacher: Adopted two-stage optimization of Tool SFT and Tool RL, focusing on learning when to call tools, how to correct errors, and when to end tasks. Combined result rewards, process rewards, and reuse of high-quality difficult tasks to improve tool usage capabilities.

3. Unified Model Stage

The research team first collected student trajectories and then had the corresponding domain teachers score and provide guidance. Different from offline imitation, the teachers directly evaluated the trajectories generated by the students themselves. Finally, through domain-routed distillation and significant vocabulary alignment, the model balanced the broad capabilities of Full-Domain SFT and the specialties of each domain teacher.

Figure: Overview of the three-stage training process of Agents-A1.

To support this training process, the research team built a knowledge-action infrastructure centered around the Knowledge-Action Graph (KAG) and continuously expanded high-quality long-trajectory data through self-play. In this way, the training samples not only contain questions and answers but also fully preserve the tool usage and verification processes.

Figure: Overview of the knowledge-action infrastructure of Agents-A1.

Experimental Results

Overall, Agents-A1 excels in long-horizon search, instruction following, and scientific reasoning tasks. It not only outperforms other 35B models but also surpasses some trillion-parameter models on certain benchmarks. The specific results are as follows:

Figure: Performance comparison between Qwen3.5-35B-A3B, Agents-A1-SFT, and Agents-A1.

1. Full-Domain SFT

The results show that Agents-A1-SFT significantly improved in long-horizon search, engineering tasks, and scientific research. However, it regressed in general Agent tasks, instruction following, and HLE. This indicates that Full-Domain SFT alone is insufficient to mitigate conflicts between different reasoning modes.

2. Domain Teacher Model Training

Search-Enhanced Teacher: Consistently outperformed Qwen3.5-35B-A3B on four benchmarks. The most significant improvement was observed on the General AI Assistant Benchmark (GAIA), where the score increased from 59.8 to 95.1.

Figure: Performance comparison between Qwen3.5-35B-A3B and the Search-Enhanced Teacher model.

Scientific-Enhanced Teacher: Two-stage SFT significantly enhanced the teacher model's scientific reasoning and tool interaction abilities. Compared to the baseline model, the Scientific-Enhanced Teacher performed better overall on various scientific tasks, especially on FS-R, where the score increased from 2.5 to 54.3.

Figure: Performance comparison between Qwen3.5-35B-A3B and the Scientific-Enhanced Teacher model.

Instruction Following and Long-Context Learning Experiments: Reinforcement learning significantly improved the model's long-context understanding, instruction following, and generalization ability to verifiable instruction constraints. Overall, the RL-Enhanced Teacher outperformed Qwen3.5-35B-A3B in relevant evaluations, with particularly significant improvements in LongBench V2 and IFBench.

Figure: Evaluation results of Qwen3.5-35B-A3B and the RL-Enhanced Teacher model on LongBench V2, IFBench, and IFEval.

Tool Usage Experiments: Explicit tool usage supervision and reinforcement learning significantly improved the model's tool usage ability, especially in tasks requiring multi-round and structured interactions. Specifically, the tool-enhanced model achieved significant improvements on τ²-Bench and VitaBench.

Figure: Performance evaluation results of Qwen3.5-35B-A3B and the tool-enhanced RL Teacher model on τ²-Bench and VitaBench.

Unified Model Experiments: The results show that Multi-Teacher OPD is more effective than simple Full-Domain SFT in mitigating conflicts between different task reasoning modes. It better integrates domain-specific expertise while maintaining broad capability coverage, further improving long-horizon task performance.

Figure: Comparison between Agents-A1 and 35B/1T-scale models.

In addition to standard benchmarks, the research team demonstrated Agents-A1's long-horizon Agent capabilities through two case studies. In the whale call detection task, Agents-A1 was able to continuously optimize the entire machine learning process over an extended period. In a 12-hour run, starting from a simple CNN baseline, the model improved the validation set AUC from 0.58 to 0.9935. This indicates that Agents-A1 has moved beyond local parameter tuning and can continuously improve solutions and enhance generalization ability through multiple iterations.

Figure: Optimization trajectory of Agents-A1 in a 12-hour run on the ICML 2013 Whale Challenge.

Agents-A1 also demonstrated comprehensive end-to-end analysis capabilities in earth science tasks. Taking the 2008 Tropical Cyclone Nargis as an example, the model was able to automatically identify data sources, extract and clean data, calculate derived indicators, visualize results, and synthesize reports, forming a multi-stage closed-loop from planning to report generation. It also reconstructed the storm evolution process with high fidelity.

Figure: Path of the 2008 Tropical Cyclone (Nargis) generated by Agents-A1.

Limitations and Future Directions

Despite its strong performance in many long-horizon tasks, Agents-A1 still has some limitations:

First, the model has room for improvement in fundamental atomic capabilities such as "plan before reasoning", "reflect before acting", summarizing key information in long contexts, and identifying important historical information. These capabilities directly affect the stability, goal consistency, and execution efficiency in long-horizon tasks. In the future, these fundamental capabilities need to be strengthened to further enhance Agents-A1's long-process problem-solving ability.

Second, in machine learning engineering tasks, there is still a significant gap between Agents-A1 and larger models. In the future, enhancing the model's goal consistency, decision memory, and experiment efficiency in the complete

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Does 35B Agent Surpass Trillion-Parameter Models? Shanghai AI Lab Open-Sources Agents-A1: Scaling the Horizon

How is Agents-A1 Designed?

Experimental Results

Limitations and Future Directions