TY - JOUR
T1 - Self-Supervised Imitation for Offline Reinforcement Learning With Hindsight Relabeling
AU - Yu, Xudong
AU - Bai, Chenjia
AU - Wang, Changhong
AU - Yu, Dengxiu
AU - Chen, C. L.Philip
AU - Wang, Zhen
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2023/12/1
Y1 - 2023/12/1
N2 - Reinforcement learning (RL) requires a lot of interactions with the environment, which is usually expensive or dangerous in real-world tasks. To address this problem, offline RL considers learning policies from fixed datasets, which is promising in utilizing large-scale datasets, but still suffers from the unstable estimation for out-of-distribution data. Recent developments in RL via supervised learning methods offer an alternative to learning effective policies from suboptimal datasets while relying on oracle information from the environment. In this article, we present an offline RL algorithm that combines hindsight relabeling and supervised regression to predict actions without oracle information. We use hindsight relabeling on the original dataset and learn a command generator and command-conditional policies in a supervised manner, where the command represents the desired return or goal location according to the corresponding task. Theoretically, we illustrate that our method optimizes the lower bound of the goal-conditional RL objective. Empirically, our method achieves competitive performance in comparison with existing approaches in the sparse reward setting and favorable performance in continuous control tasks.
AB - Reinforcement learning (RL) requires a lot of interactions with the environment, which is usually expensive or dangerous in real-world tasks. To address this problem, offline RL considers learning policies from fixed datasets, which is promising in utilizing large-scale datasets, but still suffers from the unstable estimation for out-of-distribution data. Recent developments in RL via supervised learning methods offer an alternative to learning effective policies from suboptimal datasets while relying on oracle information from the environment. In this article, we present an offline RL algorithm that combines hindsight relabeling and supervised regression to predict actions without oracle information. We use hindsight relabeling on the original dataset and learn a command generator and command-conditional policies in a supervised manner, where the command represents the desired return or goal location according to the corresponding task. Theoretically, we illustrate that our method optimizes the lower bound of the goal-conditional RL objective. Empirically, our method achieves competitive performance in comparison with existing approaches in the sparse reward setting and favorable performance in continuous control tasks.
KW - Hindsight relabeling
KW - offline reinforcement learning (RL)
KW - supervised learning
UR - http://www.scopus.com/inward/record.url?scp=85168736234&partnerID=8YFLogxK
U2 - 10.1109/TSMC.2023.3297711
DO - 10.1109/TSMC.2023.3297711
M3 - 文章
AN - SCOPUS:85168736234
SN - 2168-2216
VL - 53
SP - 7732
EP - 7743
JO - IEEE Transactions on Systems, Man, and Cybernetics: Systems
JF - IEEE Transactions on Systems, Man, and Cybernetics: Systems
IS - 12
ER -