TY - JOUR
T1 - Unsupervised learning-enhanced neighborhood relationship utilization for intrinsic reward-driven multi-agent exploration
AU - Wang, Hao
AU - Wang, Zehan
AU - Yang, Liya
AU - Shi, Haobin
N1 - Publisher Copyright:
© 2026
PY - 2026/4/22
Y1 - 2026/4/22
N2 - Multi-agent reinforcement learning (MARL) in sparse-reward environments often suffers from unreliable, uncoordinated exploration among neighboring agents. We propose Temporal Contrastive Distillation (TCD), a novel plug-and-play progressive mutual calibration architecture that establishes dynamic coordination signals for decentralized agents. Unlike conventional intrinsic reward distillation, TCD uses two modules for mutual calibration: (1) the Adaptive Attention Operator (AAO), an intrinsic reward distillation module with attention, detects emerging neighborhood-level coordination patterns; (2) the Attention Operator Evolver (AOE), driven by contrastive learning, achieves dual coordination via Contrastive Parameter Adaptation (CPA), which generates operator-updating signals, and Momentum-guided Progressive Transfer (MPT), which transfers these signals to guide AAO evolution. Through their interactions, TCD enables agents to recognize and leverage neighborhood relationships in sparse-reward settings, mitigating the challenges posed by sparse rewards. Extensive experiments on StarCraft II (SMAC) and Google Research Football (GRF) show that TCD improves performance and sample efficiency over strong baselines, helping agents discover and refine complex coordination tactics, from micromanagement in SMAC to dynamic passing in GRF, highlighting TCD's broad applicability.
AB - Multi-agent reinforcement learning (MARL) in sparse-reward environments often suffers from unreliable, uncoordinated exploration among neighboring agents. We propose Temporal Contrastive Distillation (TCD), a novel plug-and-play progressive mutual calibration architecture that establishes dynamic coordination signals for decentralized agents. Unlike conventional intrinsic reward distillation, TCD uses two modules for mutual calibration: (1) the Adaptive Attention Operator (AAO), an intrinsic reward distillation module with attention, detects emerging neighborhood-level coordination patterns; (2) the Attention Operator Evolver (AOE), driven by contrastive learning, achieves dual coordination via Contrastive Parameter Adaptation (CPA), which generates operator-updating signals, and Momentum-guided Progressive Transfer (MPT), which transfers these signals to guide AAO evolution. Through their interactions, TCD enables agents to recognize and leverage neighborhood relationships in sparse-reward settings, mitigating the challenges posed by sparse rewards. Extensive experiments on StarCraft II (SMAC) and Google Research Football (GRF) show that TCD improves performance and sample efficiency over strong baselines, helping agents discover and refine complex coordination tactics, from micromanagement in SMAC to dynamic passing in GRF, highlighting TCD's broad applicability.
KW - Contrastive learning
KW - Intrinsic-reward
KW - Multi-agent exploration
KW - Neighborhood coordination
KW - Sparse reward
UR - https://www.scopus.com/pages/publications/105030574887
U2 - 10.1016/j.knosys.2026.115540
DO - 10.1016/j.knosys.2026.115540
M3 - 文章
AN - SCOPUS:105030574887
SN - 0950-7051
VL - 339
JO - Knowledge-Based Systems
JF - Knowledge-Based Systems
M1 - 115540
ER -