Skip to main navigation Skip to search Skip to main content

Self-Supervised Imitation for Offline Reinforcement Learning With Hindsight Relabeling

  • Harbin Institute of Technology
  • Shanghai Artificial Intelligence Laboratory
  • South China University of Technology
  • University of Macau

Research output: Contribution to journalArticlepeer-review

6 Scopus citations

Abstract

Reinforcement learning (RL) requires a lot of interactions with the environment, which is usually expensive or dangerous in real-world tasks. To address this problem, offline RL considers learning policies from fixed datasets, which is promising in utilizing large-scale datasets, but still suffers from the unstable estimation for out-of-distribution data. Recent developments in RL via supervised learning methods offer an alternative to learning effective policies from suboptimal datasets while relying on oracle information from the environment. In this article, we present an offline RL algorithm that combines hindsight relabeling and supervised regression to predict actions without oracle information. We use hindsight relabeling on the original dataset and learn a command generator and command-conditional policies in a supervised manner, where the command represents the desired return or goal location according to the corresponding task. Theoretically, we illustrate that our method optimizes the lower bound of the goal-conditional RL objective. Empirically, our method achieves competitive performance in comparison with existing approaches in the sparse reward setting and favorable performance in continuous control tasks.

Original languageEnglish
Pages (from-to)7732-7743
Number of pages12
JournalIEEE Transactions on Systems, Man, and Cybernetics: Systems
Volume53
Issue number12
DOIs
StatePublished - 1 Dec 2023

Keywords

  • Hindsight relabeling
  • offline reinforcement learning (RL)
  • supervised learning

Fingerprint

Dive into the research topics of 'Self-Supervised Imitation for Offline Reinforcement Learning With Hindsight Relabeling'. Together they form a unique fingerprint.

Cite this