Strategy rewrite: "History of World War I". The Chinese Academy of Sciences has open-sourced a brand-new game-playing intelligent agent framework, DipLLM.
[Introduction] The Institute of Automation of the Chinese Academy of Sciences has proposed DipLLM, the first large language model fine - tuned agent framework in the complex strategy game Diplomacy. It surpasses others by using only 1.5% of Cicero's training data, demonstrating excellent strategic capabilities and sample efficiency. This framework transforms complex decision - making tasks into serialized subtasks through autoregressive decomposition and efficiently fine - tunes the LLM with the support of theoretical equilibrium strategy goals, providing a new paradigm for constructing more general and efficient game agents.
Go and Texas Hold'em were once the testing grounds for the rise of AI. From AlphaGo to Libratus, artificial intelligence has continuously raised the strategic ceiling.
However, the next battlefield is even more challenging - Diplomacy: a seven - player game that combines cooperation and competition, with a single - round action space as high as 10 to the 64th power. Its strategic modeling complexity is unprecedented!
To this end, Meta once launched the agent Cicero [Meta, Science 2022], which combined human data and strategy search to achieve a breakthrough in this field. However, its method highly depends on ultra - large - scale equilibrium search and resource - intensive training, making it difficult to expand and migrate.
Now, a research result from the Institute of Automation of the Chinese Academy of Sciences has been selected for ICML 2025, proposing a new - paradigm game agent framework - DipLLM. For the first time, it explores a strategy learning method based on large language model fine - tuning in Diplomacy, significantly reducing resource requirements and demonstrating excellent strategic capabilities and sample efficiency.
DipLLM is built on an autoregressive decomposition framework, which transforms the high - dimensional joint action modeling task into serialized subtasks and efficiently fine - tunes the LLM with the support of theoretical equilibrium strategy goals.
Using only 1.5% of Cicero's training data, it surpasses in performance, showing strong strategic capabilities and amazing sample efficiency.
Paper link: https://arxiv.org/pdf/2506.09655
Open - source code: https://github.com/KaiXIIM/dipllm
The first author of the paper is Xu Kaixuan, a second - year direct - doctoral student at the Institute of Automation of the Chinese Academy of Sciences; the co - first author is Chai Jiajun, a fifth - year direct - doctoral student at the Institute of Automation of the Chinese Academy of Sciences; the corresponding author is Zhu Yuanheng, an associate researcher at the Institute of Automation of the Chinese Academy of Sciences. Their research directions include post - training of large - model reinforcement learning and agents, multi - agent reinforcement learning, and multi - embodied intelligence.
Research Background
Although classic game tasks such as Go and chess have been widely studied, their action spaces are generally within the thousands. In Diplomacy, players need to make decisions for multiple units simultaneously, and the number of joint action combinations per round is as high as 10 to the 64th power, leading to a sharp increase in the difficulty of strategy learning and modeling.
Currently, most mainstream methods rely on equilibrium search to generate large - scale game data for strategy fitting.
For example, Cicero used 448 GPUs in parallel to generate data during the training phase, which is costly and difficult to scale.
In recent years, large language models (LLMs) have demonstrated strong generalization and reasoning abilities, bringing new possibilities for complex decision - making tasks. Although prompt - based methods can be quickly adapted to some tasks, in complex games such as Diplomacy, their strategy generation ability is still limited by the performance of the base model.
Existing research has shown that fine - tuning LLMs can significantly improve strategic performance [Zhai et al., NeurIPS 2024].
However, in complex games, there are still many challenges in constructing a reasonable training framework and optimization goals, especially the decision - making obstacles caused by ultra - large - scale action spaces and the lack of equilibrium strategies in complex multi - agent games.
DipLLM: An Autoregressive Strategy Decomposition Agent for Complex Games
To solve the above problems, researchers have proposed an LLM agent suitable for complex game environments, and the construction process includes three key steps.
Step 1: An Autoregressive Decomposition Framework Based on Large Language Models
In the Diplomacy game, players need to select actions for up to 34 units simultaneously, with each unit having about 26 choices, resulting in an exponentially growing joint action space.
To address this, researchers have proposed an autoregressive factorization framework based on large language models, which breaks down complex joint decision - making tasks into a series of ordered unit - action selection subtasks.
Specifically, the player's overall strategy is represented as:
Each sub - strategy depends on the current game state s and the actions of the previous d - 1 units, thus generating the action of the current unit in sequence.
This form naturally aligns with the "next - token prediction" mechanism that LLMs are good at, enabling the model to gradually output the action decisions of each unit.
During the inference phase, the LLM first converts the original game state into a text format s, and then for each unit, it combines its number and the actions of the previous units
to construct a prompt and generate an action
which is finally spliced into a complete joint action.
Step 2: Strategy Learning Objectives under the Autoregressive Decomposition Framework
To effectively guide the fine - tuning process, researchers redefined the strategy learning objectives under the autoregressive decomposition framework to learn approximate Nash equilibrium strategies.
In traditional methods, such as piKL - Hedge [Jacob et al., ICML 2022], the player's strategy is usually modeled as a centralized decision, and the strategy of player i is guided by the joint action value function
and the anchor strategy
together:
Among them, the anchor strategy
is a human - like strategy obtained through imitation learning based on human data, which avoids excessive deviation from the scope understandable by humans during the search process.
To define the strategy learning objectives under decomposition, researchers decomposed the joint action value
into a series of unit - level sub - action values
representing the decomposed action value of the d - th unit:
Based on this decomposition, the following unit - level strategy learning objectives are defined:
Theoretical Guarantee
Researchers further analyzed the properties of this strategy learning objective in the game environment from a theoretical perspective and proposed two key theorems to support it:
- Theorem 1 (Strategy Equivalence) The joint strategy derived from the autoregressive decomposition strategy learning objective
is equivalent to the original strategy distribution π, achieving more efficient modeling without losing the strategy expression ability.
- Theorem 2 (Convergence to Approximate Nash Equilibrium) In a two - player zero - sum game, if both players use the autoregressive decomposition strategy learning objective to iteratively update their strategies for T rounds, their average strategies will converge to an approximate Nash equilibrium.
Step 3: Fine - Tuning the Large Language Model to Approximate the Equilibrium Strategy Objective
To guide the model's strategy to approximate the equilibrium objective, researchers constructed a data generation and fine - tuning process that combines game interaction and value decomposition.
Data Collection
By letting a specific model DipNet [Paquette et al., NeurIPS 2019] interact with the Diplomacy environment, the original game data was collected, and the joint action value function was calculated using the equilibrium search algorithm piKL - Hedge
To adapt to the autoregressive decomposition strategy structure, researchers further decomposed the joint action value into unit - level action values