TY - GEN
T1 - Model-Based Offline Adaptive Policy Optimization with Episodic Memory
AU - Cao, Hongye
AU - Wei, Qianru
AU - Zheng, Jiangbin
AU - Shi, Yanqing
N1 - Publisher Copyright:
© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2022
Y1 - 2022
N2 - Offline reinforcement learning (RL) is a promising direction to apply RL to real-world by avoiding online expensive and dangerous exploration. However, offline RL is challenging due to extrapolation errors caused by the distribution shift between offline datasets and states visited by behavior policy. Existing model-based offline RL methods set pessimistic constraints of the learned model within the support region of the offline data to avoid extrapolation errors, but these approaches limit the generalization potential of the policy in out-of-distribution (OOD) region. The artificial fixed uncertainty calculation and the sparse reward problem of low-quality datasets in existing methods have weak adaptability to different learning tasks. Hence, a model-based offline adaptive policy optimization with episodic memory is proposed in this work to improve generalization of the policy. Inspired by active learning, constraint strength is proposed to trade off the return and risk adaptively to balance the robustness and generalization ability of the policy. Further, episodic memory is applied to capture successful experience to improve adaptability. Extensive experiments on D4RL datasets demonstrate that the proposed method outperforms existing state-of-the-art methods and achieves superior performance on challenging tasks requiring OOD generalization.
AB - Offline reinforcement learning (RL) is a promising direction to apply RL to real-world by avoiding online expensive and dangerous exploration. However, offline RL is challenging due to extrapolation errors caused by the distribution shift between offline datasets and states visited by behavior policy. Existing model-based offline RL methods set pessimistic constraints of the learned model within the support region of the offline data to avoid extrapolation errors, but these approaches limit the generalization potential of the policy in out-of-distribution (OOD) region. The artificial fixed uncertainty calculation and the sparse reward problem of low-quality datasets in existing methods have weak adaptability to different learning tasks. Hence, a model-based offline adaptive policy optimization with episodic memory is proposed in this work to improve generalization of the policy. Inspired by active learning, constraint strength is proposed to trade off the return and risk adaptively to balance the robustness and generalization ability of the policy. Further, episodic memory is applied to capture successful experience to improve adaptability. Extensive experiments on D4RL datasets demonstrate that the proposed method outperforms existing state-of-the-art methods and achieves superior performance on challenging tasks requiring OOD generalization.
KW - Constraint strength
KW - Episodic memory
KW - Offline reinforcement learning
UR - http://www.scopus.com/inward/record.url?scp=85138690595&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-15931-2_5
DO - 10.1007/978-3-031-15931-2_5
M3 - 会议稿件
AN - SCOPUS:85138690595
SN - 9783031159305
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 50
EP - 62
BT - Artificial Neural Networks and Machine Learning - ICANN 2022 - 31st International Conference on Artificial Neural Networks, Proceedings
A2 - Pimenidis, Elias
A2 - Aydin, Mehmet
A2 - Angelov, Plamen
A2 - Jayne, Chrisina
A2 - Papaleonidas, Antonios
PB - Springer Science and Business Media Deutschland GmbH
T2 - 31st International Conference on Artificial Neural Networks, ICANN 2022
Y2 - 6 September 2022 through 9 September 2022
ER -