An Index Policy Based on Sarsa and Q -Learning for Heterogeneous Smart Target Tracking

Yuhang Hao; Zengfu Wang; Jing Fu; Quan Pan; Tao Yun

doi:10.1109/JSEN.2024.3461722

An Index Policy Based on Sarsa and Q -Learning for Heterogeneous Smart Target Tracking

Yuhang Hao, Zengfu Wang, Jing Fu, Quan Pan, Tao Yun

自动化学院

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

In solving the nonmyopic radar scheduling for multiple smart target tracking within an active and passive radar network (APRN), both short-term enhanced tracking performance and a higher probability of target maneuvering in the future with active tracking must be considered. Acquiring the long-term tracking performance exhibits the curse of dimensionality, where optimal solutions are in general intractable. Meanwhile, the unknown dynamic mode transition of smart targets complicates the beam scheduling problem. This article models this problem as a Markov decision process (MDP) consisting of parallel restless bandit processes. Each bandit process is associated with a smart target, of which mode states are defined by the dynamic modes. The mode state evolves according to different dynamic model transitions under different actions - whether or not the target is being actively tracked. For unknown state transition matrices, this article proposes a new method that utilizes the forward state-action-reward-state-action (Sarsa) and backward Q-learning (QL) to approximate the indices through adapting the state-action value functions, or equivalently the Q-functions. The efficient scheduling policy follows the indices that are real numbers representing the marginal rewards of taking different actions. A new policy, namely, index policy based on the Sarsa and Q-learning (ISQ), is proposed to maximize the long-term tracking rewards. Numerical results demonstrate that the proposed ISQ policy outperforms conventional QL-based methods and the deep Q-network (DQN) algorithm. It also rapidly converges to the well-known Whittle index policy with revealed state transition models, which is considered the benchmark.

源语言	英语
页（从-至）	36127-36142
页数	16
期刊	IEEE Sensors Journal
卷	24
期	21
DOI	https://doi.org/10.1109/JSEN.2024.3461722
出版状态	已出版 - 2024

访问文件

10.1109/JSEN.2024.3461722

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{bc836f69c49b43c7acd8d20d16747329,

title = "An Index Policy Based on Sarsa and Q -Learning for Heterogeneous Smart Target Tracking",

abstract = "In solving the nonmyopic radar scheduling for multiple smart target tracking within an active and passive radar network (APRN), both short-term enhanced tracking performance and a higher probability of target maneuvering in the future with active tracking must be considered. Acquiring the long-term tracking performance exhibits the curse of dimensionality, where optimal solutions are in general intractable. Meanwhile, the unknown dynamic mode transition of smart targets complicates the beam scheduling problem. This article models this problem as a Markov decision process (MDP) consisting of parallel restless bandit processes. Each bandit process is associated with a smart target, of which mode states are defined by the dynamic modes. The mode state evolves according to different dynamic model transitions under different actions - whether or not the target is being actively tracked. For unknown state transition matrices, this article proposes a new method that utilizes the forward state-action-reward-state-action (Sarsa) and backward Q-learning (QL) to approximate the indices through adapting the state-action value functions, or equivalently the Q-functions. The efficient scheduling policy follows the indices that are real numbers representing the marginal rewards of taking different actions. A new policy, namely, index policy based on the Sarsa and Q-learning (ISQ), is proposed to maximize the long-term tracking rewards. Numerical results demonstrate that the proposed ISQ policy outperforms conventional QL-based methods and the deep Q-network (DQN) algorithm. It also rapidly converges to the well-known Whittle index policy with revealed state transition models, which is considered the benchmark.",

keywords = "Index policy, Q-learning (QL), radar scheduling, state-action-reward-state-action (Sarsa), target tracking",

author = "Yuhang Hao and Zengfu Wang and Jing Fu and Quan Pan and Tao Yun",

note = "Publisher Copyright: {\textcopyright} 2001-2012 IEEE.",

year = "2024",

doi = "10.1109/JSEN.2024.3461722",

language = "英语",

volume = "24",

pages = "36127--36142",

journal = "IEEE Sensors Journal",

issn = "1530-437X",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "21",

}

TY - JOUR

T1 - An Index Policy Based on Sarsa and Q -Learning for Heterogeneous Smart Target Tracking

AU - Hao, Yuhang

AU - Wang, Zengfu

AU - Fu, Jing

AU - Pan, Quan

AU - Yun, Tao

PY - 2024

Y1 - 2024

N2 - In solving the nonmyopic radar scheduling for multiple smart target tracking within an active and passive radar network (APRN), both short-term enhanced tracking performance and a higher probability of target maneuvering in the future with active tracking must be considered. Acquiring the long-term tracking performance exhibits the curse of dimensionality, where optimal solutions are in general intractable. Meanwhile, the unknown dynamic mode transition of smart targets complicates the beam scheduling problem. This article models this problem as a Markov decision process (MDP) consisting of parallel restless bandit processes. Each bandit process is associated with a smart target, of which mode states are defined by the dynamic modes. The mode state evolves according to different dynamic model transitions under different actions - whether or not the target is being actively tracked. For unknown state transition matrices, this article proposes a new method that utilizes the forward state-action-reward-state-action (Sarsa) and backward Q-learning (QL) to approximate the indices through adapting the state-action value functions, or equivalently the Q-functions. The efficient scheduling policy follows the indices that are real numbers representing the marginal rewards of taking different actions. A new policy, namely, index policy based on the Sarsa and Q-learning (ISQ), is proposed to maximize the long-term tracking rewards. Numerical results demonstrate that the proposed ISQ policy outperforms conventional QL-based methods and the deep Q-network (DQN) algorithm. It also rapidly converges to the well-known Whittle index policy with revealed state transition models, which is considered the benchmark.

AB - In solving the nonmyopic radar scheduling for multiple smart target tracking within an active and passive radar network (APRN), both short-term enhanced tracking performance and a higher probability of target maneuvering in the future with active tracking must be considered. Acquiring the long-term tracking performance exhibits the curse of dimensionality, where optimal solutions are in general intractable. Meanwhile, the unknown dynamic mode transition of smart targets complicates the beam scheduling problem. This article models this problem as a Markov decision process (MDP) consisting of parallel restless bandit processes. Each bandit process is associated with a smart target, of which mode states are defined by the dynamic modes. The mode state evolves according to different dynamic model transitions under different actions - whether or not the target is being actively tracked. For unknown state transition matrices, this article proposes a new method that utilizes the forward state-action-reward-state-action (Sarsa) and backward Q-learning (QL) to approximate the indices through adapting the state-action value functions, or equivalently the Q-functions. The efficient scheduling policy follows the indices that are real numbers representing the marginal rewards of taking different actions. A new policy, namely, index policy based on the Sarsa and Q-learning (ISQ), is proposed to maximize the long-term tracking rewards. Numerical results demonstrate that the proposed ISQ policy outperforms conventional QL-based methods and the deep Q-network (DQN) algorithm. It also rapidly converges to the well-known Whittle index policy with revealed state transition models, which is considered the benchmark.

KW - Index policy

KW - Q-learning (QL)

KW - radar scheduling

KW - state-action-reward-state-action (Sarsa)

KW - target tracking

UR - http://www.scopus.com/inward/record.url?scp=85205434179&partnerID=8YFLogxK

U2 - 10.1109/JSEN.2024.3461722

DO - 10.1109/JSEN.2024.3461722

M3 - 文章

AN - SCOPUS:85205434179

SN - 1530-437X

VL - 24

SP - 36127

EP - 36142

JO - IEEE Sensors Journal

JF - IEEE Sensors Journal

IS - 21

ER -

An Index Policy Based on Sarsa and Q -Learning for Heterogeneous Smart Target Tracking

摘要

访问文件

其它文件与链接

指纹

引用此