TY - JOUR
T1 - Diverse randomized value functions
T2 - A provably pessimistic approach for offline reinforcement learning
AU - Yu, Xudong
AU - Bai, Chenjia
AU - Guo, Hongyi
AU - Wang, Changhong
AU - Wang, Zhen
N1 - Publisher Copyright:
© 2024 Elsevier Inc.
PY - 2024/10
Y1 - 2024/10
N2 - Offline Reinforcement Learning (RL) faces challenges such as distributional shift and unreliable value estimation, especially for out-of-distribution (OOD) actions. To address these issues, existing uncertainty-based methods penalize the value function with uncertainty quantification and require numerous ensemble networks, leading to computational challenges and suboptimal outcomes. In this paper, we introduce a novel strategy that employs diverse randomized value functions to estimate the posterior distribution of Q-values. This approach provides robust uncertainty quantification and estimates the lower confidence bounds (LCB) of Q-values. By applying moderate value penalties for OOD actions, our method fosters a provably pessimistic approach. We also emphasize diversity within randomized value functions and enhance efficiency by introducing a diversity regularization method, thereby reducing the requisite number of networks. These modules result in reliable value estimation and efficient policy learning from offline data. Theoretical analysis shows that our method recovers the provably efficient LCB-penalty under linear MDP assumptions. Extensive empirical results demonstrate that our proposed method significantly outperforms baseline methods in terms of performance and parametric efficiency.
AB - Offline Reinforcement Learning (RL) faces challenges such as distributional shift and unreliable value estimation, especially for out-of-distribution (OOD) actions. To address these issues, existing uncertainty-based methods penalize the value function with uncertainty quantification and require numerous ensemble networks, leading to computational challenges and suboptimal outcomes. In this paper, we introduce a novel strategy that employs diverse randomized value functions to estimate the posterior distribution of Q-values. This approach provides robust uncertainty quantification and estimates the lower confidence bounds (LCB) of Q-values. By applying moderate value penalties for OOD actions, our method fosters a provably pessimistic approach. We also emphasize diversity within randomized value functions and enhance efficiency by introducing a diversity regularization method, thereby reducing the requisite number of networks. These modules result in reliable value estimation and efficient policy learning from offline data. Theoretical analysis shows that our method recovers the provably efficient LCB-penalty under linear MDP assumptions. Extensive empirical results demonstrate that our proposed method significantly outperforms baseline methods in terms of performance and parametric efficiency.
KW - Distributional shift
KW - Diversification
KW - Offline reinforcement learning
KW - Pessimism
KW - Randomized value functions
UR - http://www.scopus.com/inward/record.url?scp=85198035129&partnerID=8YFLogxK
U2 - 10.1016/j.ins.2024.121146
DO - 10.1016/j.ins.2024.121146
M3 - 文章
AN - SCOPUS:85198035129
SN - 0020-0255
VL - 680
JO - Information Sciences
JF - Information Sciences
M1 - 121146
ER -