Stable and Efficient Policy Evaluation

Daoming Lyu; Bo Liu; Matthieu Geist; Wen Dong; Saad Biaz; Qi Wang

doi:10.1109/TNNLS.2018.2871361

Stable and Efficient Policy Evaluation

Daoming Lyu, Bo Liu, Matthieu Geist, Wen Dong, Saad Biaz, Qi Wang

School of Artificial Intelligence, OPtics and Electronics

Research output: Contribution to journal › Article › peer-review

8 Scopus citations

Abstract

Policy evaluation algorithms are essential to reinforcement learning due to their ability to predict the performance of a policy. However, there are two long-standing issues lying in this prediction problem that need to be tackled: off-policy stability and on-policy efficiency. The conventional temporal difference (TD) algorithm is known to perform very well in the on-policy setting, yet is not off-policy stable. On the other hand, the gradient TD and emphatic TD algorithms are off-policy stable, but are not on-policy efficient. This paper introduces novel algorithms that are both off-policy stable and on-policy efficient by using the oblique projection method. The empirical experimental results on various domains validate the effectiveness of the proposed approach.

Original language	English
Article number	8515047
Pages (from-to)	1831-1840
Number of pages	10
Journal	IEEE Transactions on Neural Networks and Learning Systems
Volume	30
Issue number	6
DOIs	https://doi.org/10.1109/TNNLS.2018.2871361
State	Published - Jun 2019

Keywords

Off-policy
policy evaluation
reinforcement learning (RL)
temporal difference (TD) learning

Access to Document

10.1109/TNNLS.2018.2871361

Cite this

@article{f09112c26b8e4d82800c7909e833af49,

title = "Stable and Efficient Policy Evaluation",

abstract = "Policy evaluation algorithms are essential to reinforcement learning due to their ability to predict the performance of a policy. However, there are two long-standing issues lying in this prediction problem that need to be tackled: off-policy stability and on-policy efficiency. The conventional temporal difference (TD) algorithm is known to perform very well in the on-policy setting, yet is not off-policy stable. On the other hand, the gradient TD and emphatic TD algorithms are off-policy stable, but are not on-policy efficient. This paper introduces novel algorithms that are both off-policy stable and on-policy efficient by using the oblique projection method. The empirical experimental results on various domains validate the effectiveness of the proposed approach.",

keywords = "Off-policy, policy evaluation, reinforcement learning (RL), temporal difference (TD) learning",

author = "Daoming Lyu and Bo Liu and Matthieu Geist and Wen Dong and Saad Biaz and Qi Wang",

note = "Publisher Copyright: {\textcopyright} 2018 IEEE.",

year = "2019",

month = jun,

doi = "10.1109/TNNLS.2018.2871361",

language = "英语",

volume = "30",

pages = "1831--1840",

journal = "IEEE Transactions on Neural Networks and Learning Systems",

issn = "2162-237X",

publisher = "IEEE Computational Intelligence Society",

number = "6",

}

TY - JOUR

T1 - Stable and Efficient Policy Evaluation

AU - Lyu, Daoming

AU - Liu, Bo

AU - Geist, Matthieu

AU - Dong, Wen

AU - Biaz, Saad

AU - Wang, Qi

PY - 2019/6

Y1 - 2019/6

N2 - Policy evaluation algorithms are essential to reinforcement learning due to their ability to predict the performance of a policy. However, there are two long-standing issues lying in this prediction problem that need to be tackled: off-policy stability and on-policy efficiency. The conventional temporal difference (TD) algorithm is known to perform very well in the on-policy setting, yet is not off-policy stable. On the other hand, the gradient TD and emphatic TD algorithms are off-policy stable, but are not on-policy efficient. This paper introduces novel algorithms that are both off-policy stable and on-policy efficient by using the oblique projection method. The empirical experimental results on various domains validate the effectiveness of the proposed approach.

AB - Policy evaluation algorithms are essential to reinforcement learning due to their ability to predict the performance of a policy. However, there are two long-standing issues lying in this prediction problem that need to be tackled: off-policy stability and on-policy efficiency. The conventional temporal difference (TD) algorithm is known to perform very well in the on-policy setting, yet is not off-policy stable. On the other hand, the gradient TD and emphatic TD algorithms are off-policy stable, but are not on-policy efficient. This paper introduces novel algorithms that are both off-policy stable and on-policy efficient by using the oblique projection method. The empirical experimental results on various domains validate the effectiveness of the proposed approach.

KW - Off-policy

KW - policy evaluation

KW - reinforcement learning (RL)

KW - temporal difference (TD) learning

UR - http://www.scopus.com/inward/record.url?scp=85055871049&partnerID=8YFLogxK

U2 - 10.1109/TNNLS.2018.2871361

DO - 10.1109/TNNLS.2018.2871361

M3 - 文章

C2 - 30387743

AN - SCOPUS:85055871049

SN - 2162-237X

VL - 30

SP - 1831

EP - 1840

JO - IEEE Transactions on Neural Networks and Learning Systems

JF - IEEE Transactions on Neural Networks and Learning Systems

IS - 6

M1 - 8515047

ER -

Stable and Efficient Policy Evaluation

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this