Self-Supervised Imitation for Offline Reinforcement Learning With Hindsight Relabeling

Xudong Yu; Chenjia Bai; Changhong Wang; Dengxiu Yu; C. L.Philip Chen; Zhen Wang

doi:10.1109/TSMC.2023.3297711

Self-Supervised Imitation for Offline Reinforcement Learning With Hindsight Relabeling

Xudong Yu, Chenjia Bai, Changhong Wang, Dengxiu Yu, C. L.Philip Chen, Zhen Wang

光电与智能研究院

科研成果: 期刊稿件 › 文章 › 同行评审

2 引用（Scopus）

摘要

Reinforcement learning (RL) requires a lot of interactions with the environment, which is usually expensive or dangerous in real-world tasks. To address this problem, offline RL considers learning policies from fixed datasets, which is promising in utilizing large-scale datasets, but still suffers from the unstable estimation for out-of-distribution data. Recent developments in RL via supervised learning methods offer an alternative to learning effective policies from suboptimal datasets while relying on oracle information from the environment. In this article, we present an offline RL algorithm that combines hindsight relabeling and supervised regression to predict actions without oracle information. We use hindsight relabeling on the original dataset and learn a command generator and command-conditional policies in a supervised manner, where the command represents the desired return or goal location according to the corresponding task. Theoretically, we illustrate that our method optimizes the lower bound of the goal-conditional RL objective. Empirically, our method achieves competitive performance in comparison with existing approaches in the sparse reward setting and favorable performance in continuous control tasks.

源语言	英语
页（从-至）	7732-7743
页数	12
期刊	IEEE Transactions on Systems, Man, and Cybernetics: Systems
卷	53
期	12
DOI	https://doi.org/10.1109/TSMC.2023.3297711
出版状态	已出版 - 1 12月 2023

访问文件

10.1109/TSMC.2023.3297711

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{7b3021cda0394bbf88eb6be6e61f0120,

title = "Self-Supervised Imitation for Offline Reinforcement Learning With Hindsight Relabeling",

abstract = "Reinforcement learning (RL) requires a lot of interactions with the environment, which is usually expensive or dangerous in real-world tasks. To address this problem, offline RL considers learning policies from fixed datasets, which is promising in utilizing large-scale datasets, but still suffers from the unstable estimation for out-of-distribution data. Recent developments in RL via supervised learning methods offer an alternative to learning effective policies from suboptimal datasets while relying on oracle information from the environment. In this article, we present an offline RL algorithm that combines hindsight relabeling and supervised regression to predict actions without oracle information. We use hindsight relabeling on the original dataset and learn a command generator and command-conditional policies in a supervised manner, where the command represents the desired return or goal location according to the corresponding task. Theoretically, we illustrate that our method optimizes the lower bound of the goal-conditional RL objective. Empirically, our method achieves competitive performance in comparison with existing approaches in the sparse reward setting and favorable performance in continuous control tasks.",

keywords = "Hindsight relabeling, offline reinforcement learning (RL), supervised learning",

author = "Xudong Yu and Chenjia Bai and Changhong Wang and Dengxiu Yu and Chen, {C. L.Philip} and Zhen Wang",

note = "Publisher Copyright: {\textcopyright} 2013 IEEE.",

year = "2023",

month = dec,

day = "1",

doi = "10.1109/TSMC.2023.3297711",

language = "英语",

volume = "53",

pages = "7732--7743",

journal = "IEEE Transactions on Systems, Man, and Cybernetics: Systems",

issn = "2168-2216",

publisher = "IEEE Advancing Technology for Humanity",

number = "12",

}

TY - JOUR

T1 - Self-Supervised Imitation for Offline Reinforcement Learning With Hindsight Relabeling

AU - Yu, Xudong

AU - Bai, Chenjia

AU - Wang, Changhong

AU - Yu, Dengxiu

AU - Chen, C. L.Philip

AU - Wang, Zhen

PY - 2023/12/1

Y1 - 2023/12/1

N2 - Reinforcement learning (RL) requires a lot of interactions with the environment, which is usually expensive or dangerous in real-world tasks. To address this problem, offline RL considers learning policies from fixed datasets, which is promising in utilizing large-scale datasets, but still suffers from the unstable estimation for out-of-distribution data. Recent developments in RL via supervised learning methods offer an alternative to learning effective policies from suboptimal datasets while relying on oracle information from the environment. In this article, we present an offline RL algorithm that combines hindsight relabeling and supervised regression to predict actions without oracle information. We use hindsight relabeling on the original dataset and learn a command generator and command-conditional policies in a supervised manner, where the command represents the desired return or goal location according to the corresponding task. Theoretically, we illustrate that our method optimizes the lower bound of the goal-conditional RL objective. Empirically, our method achieves competitive performance in comparison with existing approaches in the sparse reward setting and favorable performance in continuous control tasks.

AB - Reinforcement learning (RL) requires a lot of interactions with the environment, which is usually expensive or dangerous in real-world tasks. To address this problem, offline RL considers learning policies from fixed datasets, which is promising in utilizing large-scale datasets, but still suffers from the unstable estimation for out-of-distribution data. Recent developments in RL via supervised learning methods offer an alternative to learning effective policies from suboptimal datasets while relying on oracle information from the environment. In this article, we present an offline RL algorithm that combines hindsight relabeling and supervised regression to predict actions without oracle information. We use hindsight relabeling on the original dataset and learn a command generator and command-conditional policies in a supervised manner, where the command represents the desired return or goal location according to the corresponding task. Theoretically, we illustrate that our method optimizes the lower bound of the goal-conditional RL objective. Empirically, our method achieves competitive performance in comparison with existing approaches in the sparse reward setting and favorable performance in continuous control tasks.

KW - Hindsight relabeling

KW - offline reinforcement learning (RL)

KW - supervised learning

UR - http://www.scopus.com/inward/record.url?scp=85168736234&partnerID=8YFLogxK

U2 - 10.1109/TSMC.2023.3297711

DO - 10.1109/TSMC.2023.3297711

M3 - 文章

AN - SCOPUS:85168736234

SN - 2168-2216

VL - 53

SP - 7732

EP - 7743

JO - IEEE Transactions on Systems, Man, and Cybernetics: Systems

JF - IEEE Transactions on Systems, Man, and Cybernetics: Systems

IS - 12

ER -

Self-Supervised Imitation for Offline Reinforcement Learning With Hindsight Relabeling

摘要

访问文件

其它文件与链接

指纹

引用此