Advantage Policy Update Based on Proximal Policy Optimization

Zilin Zeng, Junwei Wang, Zhigang Hu, Dongnan Su, Peng Shang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

In this paper, a novel policy network update approach based on Proximal Policy Optimization (PPO), Advantageous Update Policy Proximal Policy Optimization (AUP-PPO), is proposed to alleviate the problem of over-fitting caused by the use of shared layers for policy and value functions. Extended from the previous sample-efficient reinforcement learning method PPO that uses separate networks to learn policy and value functions to make them decouple optimization, AUP-PPO uses the value function to calculate the advantage and updates the policy with the loss between the current and target advantage function as a penalty term instead of the value function. Evaluated by multiple benchmark control tasks in Open-AI gym, AUP-PPO exhibits better generalization to the environment and achieves faster convergence and better robustness compared with the original PPO.

Original languageEnglish
Title of host publicationThird International Seminar on Artificial Intelligence, Networking, and Information Technology, AINIT 2022
EditorsNaijing Hu, Guanglin Zhang
PublisherSPIE
ISBN (Electronic)9781510662964
DOIs
StatePublished - 2023
Externally publishedYes
Event3rd International Seminar on Artificial Intelligence, Networking, and Information Technology, AINIT 2022 - Shanghai, China
Duration: 23 Sep 202225 Sep 2022

Publication series

NameProceedings of SPIE - The International Society for Optical Engineering
Volume12587
ISSN (Print)0277-786X
ISSN (Electronic)1996-756X

Conference

Conference3rd International Seminar on Artificial Intelligence, Networking, and Information Technology, AINIT 2022
Country/TerritoryChina
CityShanghai
Period23/09/2225/09/22

Keywords

  • deep reinforcement learning
  • policy gradient
  • proximal policy optimization
  • reinforcement learning

Fingerprint

Dive into the research topics of 'Advantage Policy Update Based on Proximal Policy Optimization'. Together they form a unique fingerprint.

Cite this