TY - JOUR
T1 - An Index Policy Based on Sarsa and Q -Learning for Heterogeneous Smart Target Tracking
AU - Hao, Yuhang
AU - Wang, Zengfu
AU - Fu, Jing
AU - Pan, Quan
AU - Yun, Tao
N1 - Publisher Copyright:
© 2001-2012 IEEE.
PY - 2024
Y1 - 2024
N2 - In solving the nonmyopic radar scheduling for multiple smart target tracking within an active and passive radar network (APRN), both short-term enhanced tracking performance and a higher probability of target maneuvering in the future with active tracking must be considered. Acquiring the long-term tracking performance exhibits the curse of dimensionality, where optimal solutions are in general intractable. Meanwhile, the unknown dynamic mode transition of smart targets complicates the beam scheduling problem. This article models this problem as a Markov decision process (MDP) consisting of parallel restless bandit processes. Each bandit process is associated with a smart target, of which mode states are defined by the dynamic modes. The mode state evolves according to different dynamic model transitions under different actions - whether or not the target is being actively tracked. For unknown state transition matrices, this article proposes a new method that utilizes the forward state-action-reward-state-action (Sarsa) and backward Q-learning (QL) to approximate the indices through adapting the state-action value functions, or equivalently the Q-functions. The efficient scheduling policy follows the indices that are real numbers representing the marginal rewards of taking different actions. A new policy, namely, index policy based on the Sarsa and Q-learning (ISQ), is proposed to maximize the long-term tracking rewards. Numerical results demonstrate that the proposed ISQ policy outperforms conventional QL-based methods and the deep Q-network (DQN) algorithm. It also rapidly converges to the well-known Whittle index policy with revealed state transition models, which is considered the benchmark.
AB - In solving the nonmyopic radar scheduling for multiple smart target tracking within an active and passive radar network (APRN), both short-term enhanced tracking performance and a higher probability of target maneuvering in the future with active tracking must be considered. Acquiring the long-term tracking performance exhibits the curse of dimensionality, where optimal solutions are in general intractable. Meanwhile, the unknown dynamic mode transition of smart targets complicates the beam scheduling problem. This article models this problem as a Markov decision process (MDP) consisting of parallel restless bandit processes. Each bandit process is associated with a smart target, of which mode states are defined by the dynamic modes. The mode state evolves according to different dynamic model transitions under different actions - whether or not the target is being actively tracked. For unknown state transition matrices, this article proposes a new method that utilizes the forward state-action-reward-state-action (Sarsa) and backward Q-learning (QL) to approximate the indices through adapting the state-action value functions, or equivalently the Q-functions. The efficient scheduling policy follows the indices that are real numbers representing the marginal rewards of taking different actions. A new policy, namely, index policy based on the Sarsa and Q-learning (ISQ), is proposed to maximize the long-term tracking rewards. Numerical results demonstrate that the proposed ISQ policy outperforms conventional QL-based methods and the deep Q-network (DQN) algorithm. It also rapidly converges to the well-known Whittle index policy with revealed state transition models, which is considered the benchmark.
KW - Index policy
KW - Q-learning (QL)
KW - radar scheduling
KW - state-action-reward-state-action (Sarsa)
KW - target tracking
UR - http://www.scopus.com/inward/record.url?scp=85205434179&partnerID=8YFLogxK
U2 - 10.1109/JSEN.2024.3461722
DO - 10.1109/JSEN.2024.3461722
M3 - 文章
AN - SCOPUS:85205434179
SN - 1530-437X
VL - 24
SP - 36127
EP - 36142
JO - IEEE Sensors Journal
JF - IEEE Sensors Journal
IS - 21
ER -