Stable and Efficient Policy Evaluation

Daoming Lyu; Bo Liu; Matthieu Geist; Wen Dong; Saad Biaz; Qi Wang

doi:10.1109/TNNLS.2018.2871361

Stable and Efficient Policy Evaluation

Daoming Lyu, Bo Liu, Matthieu Geist, Wen Dong, Saad Biaz, Qi Wang

光电与智能研究院

科研成果: 期刊稿件 › 文章 › 同行评审

8 引用（Scopus）

摘要

Policy evaluation algorithms are essential to reinforcement learning due to their ability to predict the performance of a policy. However, there are two long-standing issues lying in this prediction problem that need to be tackled: off-policy stability and on-policy efficiency. The conventional temporal difference (TD) algorithm is known to perform very well in the on-policy setting, yet is not off-policy stable. On the other hand, the gradient TD and emphatic TD algorithms are off-policy stable, but are not on-policy efficient. This paper introduces novel algorithms that are both off-policy stable and on-policy efficient by using the oblique projection method. The empirical experimental results on various domains validate the effectiveness of the proposed approach.

源语言	英语
文章编号	8515047
页（从-至）	1831-1840
页数	10
期刊	IEEE Transactions on Neural Networks and Learning Systems
卷	30
期	6
DOI	https://doi.org/10.1109/TNNLS.2018.2871361
出版状态	已出版 - 6月 2019

访问文件

10.1109/TNNLS.2018.2871361

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{f09112c26b8e4d82800c7909e833af49,

title = "Stable and Efficient Policy Evaluation",

abstract = "Policy evaluation algorithms are essential to reinforcement learning due to their ability to predict the performance of a policy. However, there are two long-standing issues lying in this prediction problem that need to be tackled: off-policy stability and on-policy efficiency. The conventional temporal difference (TD) algorithm is known to perform very well in the on-policy setting, yet is not off-policy stable. On the other hand, the gradient TD and emphatic TD algorithms are off-policy stable, but are not on-policy efficient. This paper introduces novel algorithms that are both off-policy stable and on-policy efficient by using the oblique projection method. The empirical experimental results on various domains validate the effectiveness of the proposed approach.",

keywords = "Off-policy, policy evaluation, reinforcement learning (RL), temporal difference (TD) learning",

author = "Daoming Lyu and Bo Liu and Matthieu Geist and Wen Dong and Saad Biaz and Qi Wang",

note = "Publisher Copyright: {\textcopyright} 2018 IEEE.",

year = "2019",

month = jun,

doi = "10.1109/TNNLS.2018.2871361",

language = "英语",

volume = "30",

pages = "1831--1840",

journal = "IEEE Transactions on Neural Networks and Learning Systems",

issn = "2162-237X",

publisher = "IEEE Computational Intelligence Society",

number = "6",

}

TY - JOUR

T1 - Stable and Efficient Policy Evaluation

AU - Lyu, Daoming

AU - Liu, Bo

AU - Geist, Matthieu

AU - Dong, Wen

AU - Biaz, Saad

AU - Wang, Qi

PY - 2019/6

Y1 - 2019/6

N2 - Policy evaluation algorithms are essential to reinforcement learning due to their ability to predict the performance of a policy. However, there are two long-standing issues lying in this prediction problem that need to be tackled: off-policy stability and on-policy efficiency. The conventional temporal difference (TD) algorithm is known to perform very well in the on-policy setting, yet is not off-policy stable. On the other hand, the gradient TD and emphatic TD algorithms are off-policy stable, but are not on-policy efficient. This paper introduces novel algorithms that are both off-policy stable and on-policy efficient by using the oblique projection method. The empirical experimental results on various domains validate the effectiveness of the proposed approach.

AB - Policy evaluation algorithms are essential to reinforcement learning due to their ability to predict the performance of a policy. However, there are two long-standing issues lying in this prediction problem that need to be tackled: off-policy stability and on-policy efficiency. The conventional temporal difference (TD) algorithm is known to perform very well in the on-policy setting, yet is not off-policy stable. On the other hand, the gradient TD and emphatic TD algorithms are off-policy stable, but are not on-policy efficient. This paper introduces novel algorithms that are both off-policy stable and on-policy efficient by using the oblique projection method. The empirical experimental results on various domains validate the effectiveness of the proposed approach.

KW - Off-policy

KW - policy evaluation

KW - reinforcement learning (RL)

KW - temporal difference (TD) learning

UR - http://www.scopus.com/inward/record.url?scp=85055871049&partnerID=8YFLogxK

U2 - 10.1109/TNNLS.2018.2871361

DO - 10.1109/TNNLS.2018.2871361

M3 - 文章

C2 - 30387743

AN - SCOPUS:85055871049

SN - 2162-237X

VL - 30

SP - 1831

EP - 1840

JO - IEEE Transactions on Neural Networks and Learning Systems

JF - IEEE Transactions on Neural Networks and Learning Systems

IS - 6

M1 - 8515047

ER -

Stable and Efficient Policy Evaluation

摘要

访问文件

其它文件与链接

指纹

引用此