Eliminating Primacy Bias in Online Reinforcement Learning by Self-Distillation

Jingchen Li; Haobin Shi; Huarui Wu; Chunjiang Zhao; Kao Shing Hwang

doi:10.1109/TNNLS.2024.3397704

Eliminating Primacy Bias in Online Reinforcement Learning by Self-Distillation

Jingchen Li, Haobin Shi, Huarui Wu, Chunjiang Zhao, Kao Shing Hwang

School of Computer Science

Research output: Contribution to journal › Article › peer-review

1 Scopus citations

Abstract

Excessive invalid explorations at the beginning of training lead deep reinforcement learning process to fall into the risk of overfitting, further resulting in spurious decisions, which obstruct agents in the following states and explorations. This phenomenon is termed primacy bias in online reinforcement learning. This work systematically investigates the primacy bias in online reinforcement learning, discussing the reason for primacy bias, while the characteristic of primacy bias is also analyzed. Besides, to learn a policy generalized to the following states and explorations, we develop an online reinforcement learning framework, termed self-distillation reinforcement learning (SDRL), based on knowledge distillation, allowing the agent to transfer the learned knowledge into a randomly initialized policy at regular intervals, and the new policy network is used to replace the original one in the following training. The core idea for this work is distilling knowledge from the trained policy to another policy can filter biases out, generating a more generalized policy in the learning process. Moreover, to avoid the overfitting of the new policy due to excessive distillations, we add an additional loss in the knowledge distillation process, using L2 regularization to improve the generalization, and the self-imitation mechanism is introduced to accelerate the learning on the current experiences. The results of several experiments in DMC and Atari 100k suggest the proposal has the ability to eliminate primacy bias for reinforcement learning methods, and the policy after knowledge distillation can urge agents to get higher scores more quickly.

Original language	English
Pages (from-to)	1-13
Number of pages	13
Journal	IEEE Transactions on Neural Networks and Learning Systems
DOIs	https://doi.org/10.1109/TNNLS.2024.3397704
State	Accepted/In press - 2024

Keywords

Online reinforcement learning
Reinforcement learning
Representation learning
Research and development
Task analysis
Training
Video games
Visualization
overfitting
reinforcement learning

Access to Document

10.1109/TNNLS.2024.3397704

Cite this

@article{c535119d78aa443cb2ed35c2eb756e35,

title = "Eliminating Primacy Bias in Online Reinforcement Learning by Self-Distillation",

abstract = "Excessive invalid explorations at the beginning of training lead deep reinforcement learning process to fall into the risk of overfitting, further resulting in spurious decisions, which obstruct agents in the following states and explorations. This phenomenon is termed primacy bias in online reinforcement learning. This work systematically investigates the primacy bias in online reinforcement learning, discussing the reason for primacy bias, while the characteristic of primacy bias is also analyzed. Besides, to learn a policy generalized to the following states and explorations, we develop an online reinforcement learning framework, termed self-distillation reinforcement learning (SDRL), based on knowledge distillation, allowing the agent to transfer the learned knowledge into a randomly initialized policy at regular intervals, and the new policy network is used to replace the original one in the following training. The core idea for this work is distilling knowledge from the trained policy to another policy can filter biases out, generating a more generalized policy in the learning process. Moreover, to avoid the overfitting of the new policy due to excessive distillations, we add an additional loss in the knowledge distillation process, using L2 regularization to improve the generalization, and the self-imitation mechanism is introduced to accelerate the learning on the current experiences. The results of several experiments in DMC and Atari 100k suggest the proposal has the ability to eliminate primacy bias for reinforcement learning methods, and the policy after knowledge distillation can urge agents to get higher scores more quickly.",

keywords = "Online reinforcement learning, Reinforcement learning, Representation learning, Research and development, Task analysis, Training, Video games, Visualization, overfitting, reinforcement learning",

author = "Jingchen Li and Haobin Shi and Huarui Wu and Chunjiang Zhao and Hwang, {Kao Shing}",

note = "Publisher Copyright: IEEE",

year = "2024",

doi = "10.1109/TNNLS.2024.3397704",

language = "英语",

pages = "1--13",

journal = "IEEE Transactions on Neural Networks and Learning Systems",

issn = "2162-237X",

publisher = "IEEE Computational Intelligence Society",

}

TY - JOUR

T1 - Eliminating Primacy Bias in Online Reinforcement Learning by Self-Distillation

AU - Li, Jingchen

AU - Shi, Haobin

AU - Wu, Huarui

AU - Zhao, Chunjiang

AU - Hwang, Kao Shing

N1 - Publisher Copyright: IEEE

PY - 2024

Y1 - 2024

N2 - Excessive invalid explorations at the beginning of training lead deep reinforcement learning process to fall into the risk of overfitting, further resulting in spurious decisions, which obstruct agents in the following states and explorations. This phenomenon is termed primacy bias in online reinforcement learning. This work systematically investigates the primacy bias in online reinforcement learning, discussing the reason for primacy bias, while the characteristic of primacy bias is also analyzed. Besides, to learn a policy generalized to the following states and explorations, we develop an online reinforcement learning framework, termed self-distillation reinforcement learning (SDRL), based on knowledge distillation, allowing the agent to transfer the learned knowledge into a randomly initialized policy at regular intervals, and the new policy network is used to replace the original one in the following training. The core idea for this work is distilling knowledge from the trained policy to another policy can filter biases out, generating a more generalized policy in the learning process. Moreover, to avoid the overfitting of the new policy due to excessive distillations, we add an additional loss in the knowledge distillation process, using L2 regularization to improve the generalization, and the self-imitation mechanism is introduced to accelerate the learning on the current experiences. The results of several experiments in DMC and Atari 100k suggest the proposal has the ability to eliminate primacy bias for reinforcement learning methods, and the policy after knowledge distillation can urge agents to get higher scores more quickly.

AB - Excessive invalid explorations at the beginning of training lead deep reinforcement learning process to fall into the risk of overfitting, further resulting in spurious decisions, which obstruct agents in the following states and explorations. This phenomenon is termed primacy bias in online reinforcement learning. This work systematically investigates the primacy bias in online reinforcement learning, discussing the reason for primacy bias, while the characteristic of primacy bias is also analyzed. Besides, to learn a policy generalized to the following states and explorations, we develop an online reinforcement learning framework, termed self-distillation reinforcement learning (SDRL), based on knowledge distillation, allowing the agent to transfer the learned knowledge into a randomly initialized policy at regular intervals, and the new policy network is used to replace the original one in the following training. The core idea for this work is distilling knowledge from the trained policy to another policy can filter biases out, generating a more generalized policy in the learning process. Moreover, to avoid the overfitting of the new policy due to excessive distillations, we add an additional loss in the knowledge distillation process, using L2 regularization to improve the generalization, and the self-imitation mechanism is introduced to accelerate the learning on the current experiences. The results of several experiments in DMC and Atari 100k suggest the proposal has the ability to eliminate primacy bias for reinforcement learning methods, and the policy after knowledge distillation can urge agents to get higher scores more quickly.

KW - Online reinforcement learning

KW - Reinforcement learning

KW - Representation learning

KW - Research and development

KW - Task analysis

KW - Training

KW - Video games

KW - Visualization

KW - overfitting

KW - reinforcement learning

UR - http://www.scopus.com/inward/record.url?scp=85193522543&partnerID=8YFLogxK

U2 - 10.1109/TNNLS.2024.3397704

DO - 10.1109/TNNLS.2024.3397704

M3 - 文章

AN - SCOPUS:85193522543

SN - 2162-237X

SP - 1

EP - 13

JO - IEEE Transactions on Neural Networks and Learning Systems

JF - IEEE Transactions on Neural Networks and Learning Systems

ER -

Eliminating Primacy Bias in Online Reinforcement Learning by Self-Distillation

Abstract

Keywords

Access to Document

Other files and links

Cite this