Behavior fusion for deep reinforcement learning

Haobin Shi; Meng Xu; Kao Shing Hwang; Bo Yin Cai

doi:10.1016/j.isatra.2019.08.054

Behavior fusion for deep reinforcement learning

Haobin Shi, Meng Xu, Kao Shing Hwang, Bo Yin Cai

School of Computer Science

Research output: Contribution to journal › Article › peer-review

4 Scopus citations

Abstract

For deep reinforcement learning (DRL) system, it is difficult to design a reward function for complex tasks, so this paper proposes a framework of behavior fusion for the actor–critic architecture, which learns the policy based on an advantage function that consists of two value functions. Firstly, the proposed method decomposes a complex task into several sub-tasks, and merges the trained policies for those sub-tasks into a unified policy for the complex task, instead of designing a new reward function and training for the policy. Each sub-task is trained individually by an actor–critic algorithm using a simple reward function. These pre-trained sub-tasks are building blocks that are used to rapidly assemble a rapid prototype of a complicated task. Secondly, the proposed method integrates modules in the calculation of the policy gradient by calculating the accumulated returns to reduce variation. Thirdly, two alternative methods to acquire integrated returns for the complicated task are also proposed. The Atari 2600 pong game and a wafer probe task are used to validate the performance of the proposed methods by comparison with the method using a gate network.

Original language	English
Pages (from-to)	434-444
Number of pages	11
Journal	ISA Transactions
Volume	98
DOIs	https://doi.org/10.1016/j.isatra.2019.08.054
State	Published - Mar 2020

Keywords

Actor–critic
Behavior fusion
Complex task
Deep reinforcement learning
Policy gradient

Access to Document

10.1016/j.isatra.2019.08.054

Cite this

@article{491002d77df847839fd24c8b45520683,

title = "Behavior fusion for deep reinforcement learning",

abstract = "For deep reinforcement learning (DRL) system, it is difficult to design a reward function for complex tasks, so this paper proposes a framework of behavior fusion for the actor–critic architecture, which learns the policy based on an advantage function that consists of two value functions. Firstly, the proposed method decomposes a complex task into several sub-tasks, and merges the trained policies for those sub-tasks into a unified policy for the complex task, instead of designing a new reward function and training for the policy. Each sub-task is trained individually by an actor–critic algorithm using a simple reward function. These pre-trained sub-tasks are building blocks that are used to rapidly assemble a rapid prototype of a complicated task. Secondly, the proposed method integrates modules in the calculation of the policy gradient by calculating the accumulated returns to reduce variation. Thirdly, two alternative methods to acquire integrated returns for the complicated task are also proposed. The Atari 2600 pong game and a wafer probe task are used to validate the performance of the proposed methods by comparison with the method using a gate network.",

keywords = "Actor–critic, Behavior fusion, Complex task, Deep reinforcement learning, Policy gradient",

author = "Haobin Shi and Meng Xu and Hwang, {Kao Shing} and Cai, {Bo Yin}",

note = "Publisher Copyright: {\textcopyright} 2019 ISA",

year = "2020",

month = mar,

doi = "10.1016/j.isatra.2019.08.054",

language = "英语",

volume = "98",

pages = "434--444",

journal = "ISA Transactions",

issn = "0019-0578",

publisher = "International Society of Automation",

}

TY - JOUR

T1 - Behavior fusion for deep reinforcement learning

AU - Shi, Haobin

AU - Xu, Meng

AU - Hwang, Kao Shing

AU - Cai, Bo Yin

PY - 2020/3

Y1 - 2020/3

N2 - For deep reinforcement learning (DRL) system, it is difficult to design a reward function for complex tasks, so this paper proposes a framework of behavior fusion for the actor–critic architecture, which learns the policy based on an advantage function that consists of two value functions. Firstly, the proposed method decomposes a complex task into several sub-tasks, and merges the trained policies for those sub-tasks into a unified policy for the complex task, instead of designing a new reward function and training for the policy. Each sub-task is trained individually by an actor–critic algorithm using a simple reward function. These pre-trained sub-tasks are building blocks that are used to rapidly assemble a rapid prototype of a complicated task. Secondly, the proposed method integrates modules in the calculation of the policy gradient by calculating the accumulated returns to reduce variation. Thirdly, two alternative methods to acquire integrated returns for the complicated task are also proposed. The Atari 2600 pong game and a wafer probe task are used to validate the performance of the proposed methods by comparison with the method using a gate network.

AB - For deep reinforcement learning (DRL) system, it is difficult to design a reward function for complex tasks, so this paper proposes a framework of behavior fusion for the actor–critic architecture, which learns the policy based on an advantage function that consists of two value functions. Firstly, the proposed method decomposes a complex task into several sub-tasks, and merges the trained policies for those sub-tasks into a unified policy for the complex task, instead of designing a new reward function and training for the policy. Each sub-task is trained individually by an actor–critic algorithm using a simple reward function. These pre-trained sub-tasks are building blocks that are used to rapidly assemble a rapid prototype of a complicated task. Secondly, the proposed method integrates modules in the calculation of the policy gradient by calculating the accumulated returns to reduce variation. Thirdly, two alternative methods to acquire integrated returns for the complicated task are also proposed. The Atari 2600 pong game and a wafer probe task are used to validate the performance of the proposed methods by comparison with the method using a gate network.

KW - Actor–critic

KW - Behavior fusion

KW - Complex task

KW - Deep reinforcement learning

KW - Policy gradient

UR - http://www.scopus.com/inward/record.url?scp=85072290990&partnerID=8YFLogxK

U2 - 10.1016/j.isatra.2019.08.054

DO - 10.1016/j.isatra.2019.08.054

M3 - 文章

C2 - 31543262

AN - SCOPUS:85072290990

SN - 0019-0578

VL - 98

SP - 434

EP - 444

JO - ISA Transactions

JF - ISA Transactions

ER -

Behavior fusion for deep reinforcement learning

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this