Advantage Policy Update Based on Proximal Policy Optimization

Zilin Zeng; Junwei Wang; Zhigang Hu; Dongnan Su; Peng Shang

doi:10.1117/12.2667235

Advantage Policy Update Based on Proximal Policy Optimization

Zilin Zeng, Junwei Wang, Zhigang Hu, Dongnan Su, Peng Shang

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

1 引用（Scopus）

摘要

In this paper, a novel policy network update approach based on Proximal Policy Optimization (PPO), Advantageous Update Policy Proximal Policy Optimization (AUP-PPO), is proposed to alleviate the problem of over-fitting caused by the use of shared layers for policy and value functions. Extended from the previous sample-efficient reinforcement learning method PPO that uses separate networks to learn policy and value functions to make them decouple optimization, AUP-PPO uses the value function to calculate the advantage and updates the policy with the loss between the current and target advantage function as a penalty term instead of the value function. Evaluated by multiple benchmark control tasks in Open-AI gym, AUP-PPO exhibits better generalization to the environment and achieves faster convergence and better robustness compared with the original PPO.

源语言	英语
主期刊名	Third International Seminar on Artificial Intelligence, Networking, and Information Technology, AINIT 2022
编辑	Naijing Hu, Guanglin Zhang
出版商	SPIE
ISBN（电子版）	9781510662964
DOI	https://doi.org/10.1117/12.2667235
出版状态	已出版 - 2023
已对外发布	是
活动	3rd International Seminar on Artificial Intelligence, Networking, and Information Technology, AINIT 2022 - Shanghai, 中国期限: 23 9月 2022 → 25 9月 2022

出版系列

姓名	Proceedings of SPIE - The International Society for Optical Engineering
卷	12587
ISSN（印刷版）	0277-786X
ISSN（电子版）	1996-756X

会议

会议	3rd International Seminar on Artificial Intelligence, Networking, and Information Technology, AINIT 2022
国家/地区	中国
市	Shanghai
时期	23/09/22 → 25/09/22

访问文件

10.1117/12.2667235

其它文件与链接

链接到 Scopus 的出版物

引用此

Zeng, Z., Wang, J., Hu, Z., Su, D., & Shang, P. (2023). Advantage Policy Update Based on Proximal Policy Optimization. 在 N. Hu, & G. Zhang (编辑), Third International Seminar on Artificial Intelligence, Networking, and Information Technology, AINIT 2022 文章 125870D (Proceedings of SPIE - The International Society for Optical Engineering; 卷 12587). SPIE. https://doi.org/10.1117/12.2667235

@inproceedings{194633d34e3543929ac27e2ab50d3073,

title = "Advantage Policy Update Based on Proximal Policy Optimization",

abstract = "In this paper, a novel policy network update approach based on Proximal Policy Optimization (PPO), Advantageous Update Policy Proximal Policy Optimization (AUP-PPO), is proposed to alleviate the problem of over-fitting caused by the use of shared layers for policy and value functions. Extended from the previous sample-efficient reinforcement learning method PPO that uses separate networks to learn policy and value functions to make them decouple optimization, AUP-PPO uses the value function to calculate the advantage and updates the policy with the loss between the current and target advantage function as a penalty term instead of the value function. Evaluated by multiple benchmark control tasks in Open-AI gym, AUP-PPO exhibits better generalization to the environment and achieves faster convergence and better robustness compared with the original PPO.",

keywords = "deep reinforcement learning, policy gradient, proximal policy optimization, reinforcement learning",

author = "Zilin Zeng and Junwei Wang and Zhigang Hu and Dongnan Su and Peng Shang",

note = "Publisher Copyright: {\textcopyright} 2023 SPIE.; 3rd International Seminar on Artificial Intelligence, Networking, and Information Technology, AINIT 2022 ; Conference date: 23-09-2022 Through 25-09-2022",

year = "2023",

doi = "10.1117/12.2667235",

language = "英语",

series = "Proceedings of SPIE - The International Society for Optical Engineering",

publisher = "SPIE",

editor = "Naijing Hu and Guanglin Zhang",

booktitle = "Third International Seminar on Artificial Intelligence, Networking, and Information Technology, AINIT 2022",

}

Zeng, Z, Wang, J, Hu, Z, Su, D & Shang, P 2023, Advantage Policy Update Based on Proximal Policy Optimization. 在 N Hu & G Zhang (编辑), Third International Seminar on Artificial Intelligence, Networking, and Information Technology, AINIT 2022., 125870D, Proceedings of SPIE - The International Society for Optical Engineering, 卷 12587, SPIE, 3rd International Seminar on Artificial Intelligence, Networking, and Information Technology, AINIT 2022, Shanghai, 中国, 23/09/22. https://doi.org/10.1117/12.2667235

Advantage Policy Update Based on Proximal Policy Optimization. / Zeng, Zilin; Wang, Junwei; Hu, Zhigang 等.
Third International Seminar on Artificial Intelligence, Networking, and Information Technology, AINIT 2022. 编辑 / Naijing Hu; Guanglin Zhang. SPIE, 2023. 125870D (Proceedings of SPIE - The International Society for Optical Engineering; 卷 12587).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Advantage Policy Update Based on Proximal Policy Optimization

AU - Zeng, Zilin

AU - Wang, Junwei

AU - Hu, Zhigang

AU - Su, Dongnan

AU - Shang, Peng

PY - 2023

Y1 - 2023

N2 - In this paper, a novel policy network update approach based on Proximal Policy Optimization (PPO), Advantageous Update Policy Proximal Policy Optimization (AUP-PPO), is proposed to alleviate the problem of over-fitting caused by the use of shared layers for policy and value functions. Extended from the previous sample-efficient reinforcement learning method PPO that uses separate networks to learn policy and value functions to make them decouple optimization, AUP-PPO uses the value function to calculate the advantage and updates the policy with the loss between the current and target advantage function as a penalty term instead of the value function. Evaluated by multiple benchmark control tasks in Open-AI gym, AUP-PPO exhibits better generalization to the environment and achieves faster convergence and better robustness compared with the original PPO.

AB - In this paper, a novel policy network update approach based on Proximal Policy Optimization (PPO), Advantageous Update Policy Proximal Policy Optimization (AUP-PPO), is proposed to alleviate the problem of over-fitting caused by the use of shared layers for policy and value functions. Extended from the previous sample-efficient reinforcement learning method PPO that uses separate networks to learn policy and value functions to make them decouple optimization, AUP-PPO uses the value function to calculate the advantage and updates the policy with the loss between the current and target advantage function as a penalty term instead of the value function. Evaluated by multiple benchmark control tasks in Open-AI gym, AUP-PPO exhibits better generalization to the environment and achieves faster convergence and better robustness compared with the original PPO.

KW - deep reinforcement learning

KW - policy gradient

KW - proximal policy optimization

KW - reinforcement learning

UR - http://www.scopus.com/inward/record.url?scp=85159283246&partnerID=8YFLogxK

U2 - 10.1117/12.2667235

DO - 10.1117/12.2667235

M3 - 会议稿件

AN - SCOPUS:85159283246

T3 - Proceedings of SPIE - The International Society for Optical Engineering

BT - Third International Seminar on Artificial Intelligence, Networking, and Information Technology, AINIT 2022

A2 - Hu, Naijing

A2 - Zhang, Guanglin

PB - SPIE

T2 - 3rd International Seminar on Artificial Intelligence, Networking, and Information Technology, AINIT 2022

Y2 - 23 September 2022 through 25 September 2022

ER -

Advantage Policy Update Based on Proximal Policy Optimization

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此