The era of AI "agent organizations" has begun, and Microsoft has proposed asynchronous thinking, AsyncThink.
The transition from large language models (LLMs) to agents represents a paradigm shift in artificial intelligence (AI) systems from "talking" to "doing."
Furthermore, when multiple agents appear in an organizational form and produce results that surpass individual intelligence through collaborative and parallel work, the next paradigm of AI - the "agentic organization" - emerges.
However, despite the impressive reasoning abilities demonstrated by current LLMs as individual agents, to truly realize the vision of an "agentic organization," LLMs must not only be able to think independently but also engage in collaborative thinking as an organized system.
To address this, the Microsoft team proposed a new LLM reasoning method called "AsyncThink," which organizes the internal thinking process into a structure that can be executed concurrently, thereby solving the problems of high latency, poor adaptability, and low dynamism in existing parallel thinking methods.
Experiments show that compared to parallel thinking, AsyncThink improves the accuracy of mathematical reasoning while reducing reasoning latency by 28%. Additionally, AsyncThink can generalize the asynchronous thinking ability it has learned and effectively handle unseen tasks without additional training.
Paper link: https://arxiv.org/pdf/2510.26658
Research Method
The core of AsyncThink is the "Organizer - Worker" thinking protocol. In this protocol, the LLM plays two roles:
On the one hand, it acts as an "organizer," responsible for breaking down complex problems into subtasks and arranging the order of tasks through "Fork" and "Join." On the other hand, it serves as a "worker," executing these subtasks and returning intermediate results.
Figure | An example of AsyncThink's thinking protocol. This protocol enables asynchronous thinking through Fork - Join operations to control the thinking trajectory.
In this way, the model can not only process multiple sub - problems in parallel but also dynamically adjust its thinking, achieving more flexible and efficient reasoning.
To train the AsyncThink model, they proposed a two - stage training process: cold - start format fine - tuning and reinforcement learning.
1. Cold - start Format Fine - tuning
In this stage, the existing LLM undergoes cold - start format fine - tuning to master the organizational syntax and action structure of the AsyncThink framework.
In the data synthesis phase, since there are almost no "organizer - worker" thinking samples in the existing corpus, the research team used GPT - 4o to generate synthetic training data. GPT - 4o first analyzes each input problem, identifies the thinking segments that can be solved independently, and then generates the reasoning trajectories of the organizer and the worker according to the AsyncThink protocol format.
In the structure initialization phase, to enhance the flexibility of the model structure, the research team randomly samples different organizational action sequences and embeds one of the structure samples into the training prompt, allowing the model to learn under various structures and generate more diverse thinking topologies.
After data synthesis and structure initialization, the research team performs supervised fine - tuning on the base LLM, endowing the model with the ability to issue effective organizer actions.
In this stage, the model has not yet learned to produce correct answers through asynchronous thinking but only imitates the format.
2. Reinforcement Learning
Since the first stage only teaches the syntactic structure of organizer actions, the model still lacks the ability to use this thinking mechanism to generate final answers. Therefore, the research team conducted the second stage - reinforcement learning, guiding the model to learn high - efficiency and high - accuracy strategies through rewards.
Figure | Schematic diagram of the AsyncThink reinforcement learning framework.
In the reward model, accuracy rewards ensure that the final answer is correct; format rewards ensure that the trajectory generated by the model is executable; thinking concurrency rewards encourage the model to look for opportunities to think asynchronously rather than sequentially.
During training, the research team improved the Grouped Relative Policy Optimization (GRPO) algorithm to adapt it to the asynchronous structure. Instead of generating a simple chain of thought (CoT), the model generates a "thinking structure" composed of an organizer and multiple workers. The final reward is shared among all outputs of the entire structure, ensuring that each part is optimized towards the same goal.
Through a refined reward model and optimization mechanism, the AsyncThink model can dynamically and efficiently coordinate its internal "agentic organization" to solve practical problems.
Experimental Evaluation
The research team evaluated the performance of the AsyncThink model on multi - solution countdown, mathematical reasoning, and Sudoku tasks. Experiments show that compared with sequential thinking and parallel thinking models, AsyncThink consistently achieves higher accuracy while reducing latency .
In addition, the research team further analyzed its performance through ablation studies, highlighting the effectiveness of AsyncThink's "two - stage training process."
Details are as follows:
1. Multi - solution Countdown Experiment
The full - correct rate of AsyncThink reaches 89.0%, higher than that of parallel thinking (68.6%) and sequential thinking (70.5%). This means it not only has higher accuracy but also covers more solutions.
Figure | Evaluation results of the multi - solution countdown task. ≥a Correct indicates whether the model can successfully find the unique correct solution to a given problem.
2. Mathematical Reasoning Experiment
On AIME - 24: The accuracy of AsyncThink is 38.7%, and the latency is 1468.0; on AMC - 23: The accuracy of AsyncThink is 73.3%, and the latency is 1459.5. Compared with traditional parallel reasoning, it reduces the reasoning latency by about 28% while ensuring accuracy.
Figure | Evaluation results of mathematical reasoning on AIME - 24 and AMC - 23.
3. Cross - task Generalization Experiment
Although only trained on the countdown task, when directly transferred to 4×4 Sudoku, AsyncThink still performs the best (with an accuracy rate of 89.4% and the lowest latency). This shows that the LLM has learned not a specific pattern but a transferable organizational thinking pattern.
Figure | Evaluation results of AsyncThink on the 4 × 4 Sudoku task.
4. Ablation Experiment
In the ablation experiment, the research team found that format fine - tuning (Format SFT) enables the LLM to learn "language," that is, how to Fork and Join; reinforcement learning (RL) enables the LLM to learn "strategy," that is, when to Fork and how to Join to be faster and more accurate; concurrency reward (Rη Reward) enables the LLM to learn "efficiency" - balancing accuracy and latency.
Figure | Results of the ablation experiment by removing the key components of AsyncThink.
Future Work
Although AsyncThink shows significant advantages in improving LLM reasoning accuracy and reducing reasoning latency, it is only a starting point for realizing the vision of "agentic organization."
In future work, the research team will continue to explore "agentic organization" in three aspects: "scale/diversity expansion," "recursive agentic organization," and "human - AI agentic organization."
1. Expand the Scale and Diversity of Agents
First, expand the number of "workers." Future work should explore the scaling laws of asynchronous thinking: how the accuracy - latency trade - off will evolve as the capacity of the agent pool grows from a few to hundreds or even thousands.
Second, expand the diversity of agents. Move beyond a homogeneous agent pool to a large organization composed of heterogeneous expert workers. These agents can be fine - tuned for specific domains (such as mathematics, coding, data analysis), and importantly, they can be equipped with different external tools (such as code interpreters, database query engines, or web search APIs). This poses a more complex and powerful learning problem for the organizer.
2. Recursive Agentic Organization
In this paradigm, any worker can be dynamically promoted to a sub - organizer, thereby obtaining the ability to Fork its own sub - worker team. This will achieve a flexible hierarchical structure that naturally suits deeply nested and complex problems that require multi - level decomposition. For example, an excellent organizer may delegate a broad query, such as "solve * problem," and the designated worker acts as a sub - organizer, Forking out three new sub - workers to independently test different lemmas in parallel.
3. Human - AI Agentic Organization
A key frontier is to create a human - AI collaborative framework by directly integrating humans into the agentic organization. This may involve "humans as organizers," using the Fork protocol to assign tasks to AI workers, or "humans as workers," with AI Forking out tasks that require human judgment. In addition, collaborative planning will allow humans and AI to jointly design asynchronous strategies before execution. This direction goes beyond pure AI autonomy and will achieve powerful hybrid intelligence.
This article is from the WeChat public account "Academic Headlines" (ID: SciTouTiao). Compiled by Xiaoxiao. Republished by 36Kr with authorization.