TY - JOUR
T1 - Cournot Policy Model
T2 - Rethinking centralized training in multi-agent reinforcement learning
AU - Li, Jingchen
AU - Yang, Yusen
AU - He, Ziming
AU - Wu, Huarui
AU - Shi, Haobin
AU - Chen, Wenbai
N1 - Publisher Copyright:
© 2024 Elsevier Inc.
PY - 2024/8
Y1 - 2024/8
N2 - This work studies Centralized Training and Decentralized Execution (CTDE), which is a powerful mechanism to ease multi-agent reinforcement learning. Although the centralized evaluation ensures unbiased estimates of Q-value, peers with unknown policies make the decentralized policy far from the expectation. To make progress in more stabilized and effective joint policy, we develop a novel game framework, termed Cournot Policy Model, to enhance the CTDE-based multi-agent learning. Combining the game theory and reinforcement learning, we regard the joint decision-making in a single time step as a Cournot duopoly model, and then design a Hetero Variational Auto-Encoder to model the policies of peers in the decentralized execution. With a conditional policy, each agent is guided to a stable mixed-strategy equilibrium even though the joint policy evolves over time. We further demonstrate that such an equilibrium must exist in the case of centralized evaluation. We investigate the improvement of our method on existing centralized learning methods. The experimental results on a comprehensive collection of benchmarks indicate our approach consistently outperforms baseline methods.
AB - This work studies Centralized Training and Decentralized Execution (CTDE), which is a powerful mechanism to ease multi-agent reinforcement learning. Although the centralized evaluation ensures unbiased estimates of Q-value, peers with unknown policies make the decentralized policy far from the expectation. To make progress in more stabilized and effective joint policy, we develop a novel game framework, termed Cournot Policy Model, to enhance the CTDE-based multi-agent learning. Combining the game theory and reinforcement learning, we regard the joint decision-making in a single time step as a Cournot duopoly model, and then design a Hetero Variational Auto-Encoder to model the policies of peers in the decentralized execution. With a conditional policy, each agent is guided to a stable mixed-strategy equilibrium even though the joint policy evolves over time. We further demonstrate that such an equilibrium must exist in the case of centralized evaluation. We investigate the improvement of our method on existing centralized learning methods. The experimental results on a comprehensive collection of benchmarks indicate our approach consistently outperforms baseline methods.
KW - Machine learning
KW - Multi-agent reinforcement learning
KW - Multi-agent system
UR - http://www.scopus.com/inward/record.url?scp=85195701321&partnerID=8YFLogxK
U2 - 10.1016/j.ins.2024.120983
DO - 10.1016/j.ins.2024.120983
M3 - 文章
AN - SCOPUS:85195701321
SN - 0020-0255
VL - 677
JO - Information Sciences
JF - Information Sciences
M1 - 120983
ER -