TY - JOUR
T1 - A preference-based Reinforcement Learning method of maneuver decision-making in air combat
AU - Zhang, An
AU - Mao, Zeming
AU - Zhang, Mengqi
AU - Bi, Wenhao
N1 - Publisher Copyright:
Copyright © 2026. Published by Elsevier Ltd.
PY - 2026/3/1
Y1 - 2026/3/1
N2 - Reinforcement Learning (RL) techniques have advanced significantly in addressing maneuver decision-making problems in air combat. However, existing RL methods with fixed reward structures face limitations of inconsistent preferences between dense and sparse rewards, hindering the efficient learning of the optimal policy. To overcome this challenge, a preference-based reinforcement learning method of maneuver decision-making in air combat is proposed. First, a Preference-Based Adaptive Reward Weights Generation (PBARWG) model is proposed to generate the weights of the dense rewards adaptively. This model formulates preference relationships by comparing the discounted cumulative sparse rewards across different processes. Concurrently, the preferences between the dense and sparse rewards are aligned by minimizing the preference loss function. Then, in response to the temporal features of air combat, an improved Multi-Agent Proximal Policy Optimization (MAPPO) model with the Gated Recurrent Unit (GRU) and residual structure, designated as MAPPO-GRU-PBARWG, is proposed to obtain the effective maneuver policy. Finally, results from the comparative experiments have demonstrated that the proposed method outperforms other methods, achieving a win rate of more than 50%, an extremely low crash rate, and a higher average reward level. This study highlights the effectiveness of adaptive weight generation and efficient temporal feature extraction techniques in producing air combat strategies, and provides a viable approach for autonomous maneuver decision-making in short-range air combat scenarios.
AB - Reinforcement Learning (RL) techniques have advanced significantly in addressing maneuver decision-making problems in air combat. However, existing RL methods with fixed reward structures face limitations of inconsistent preferences between dense and sparse rewards, hindering the efficient learning of the optimal policy. To overcome this challenge, a preference-based reinforcement learning method of maneuver decision-making in air combat is proposed. First, a Preference-Based Adaptive Reward Weights Generation (PBARWG) model is proposed to generate the weights of the dense rewards adaptively. This model formulates preference relationships by comparing the discounted cumulative sparse rewards across different processes. Concurrently, the preferences between the dense and sparse rewards are aligned by minimizing the preference loss function. Then, in response to the temporal features of air combat, an improved Multi-Agent Proximal Policy Optimization (MAPPO) model with the Gated Recurrent Unit (GRU) and residual structure, designated as MAPPO-GRU-PBARWG, is proposed to obtain the effective maneuver policy. Finally, results from the comparative experiments have demonstrated that the proposed method outperforms other methods, achieving a win rate of more than 50%, an extremely low crash rate, and a higher average reward level. This study highlights the effectiveness of adaptive weight generation and efficient temporal feature extraction techniques in producing air combat strategies, and provides a viable approach for autonomous maneuver decision-making in short-range air combat scenarios.
KW - Air combat
KW - Maneuver decision-making
KW - Preference learning
KW - Reinforcement learning
UR - https://www.scopus.com/pages/publications/105027533986
U2 - 10.1016/j.engappai.2026.113761
DO - 10.1016/j.engappai.2026.113761
M3 - 文章
AN - SCOPUS:105027533986
SN - 0952-1976
VL - 167
JO - Engineering Applications of Artificial Intelligence
JF - Engineering Applications of Artificial Intelligence
M1 - 113761
ER -