Boosting Knowledge Distillation via Intra-Class Logit Distribution Smoothing

Cong Li; Gong Cheng; Junwei Han

doi:10.1109/TCSVT.2023.3327113

Boosting Knowledge Distillation via Intra-Class Logit Distribution Smoothing

Cong Li, Gong Cheng, Junwei Han

School of Automation

Northwestern Polytechnical University Xian

Research output: Contribution to journal › Article › peer-review

12 Scopus citations

Abstract

Previous arts built an intimate link between knowledge distillation (KD) and label smoothing (LS) that they both impose regularization on the model training. In this paper, we delve deeper into investigating the hidden reason rendering KD and LS to exert distinct effects on a model's potential ability in sequential knowledge transferring. Specifically, we observe that the distilled model typically exhibits much higher intra-class variance than the regularized one, consequentially acting as the better teacher. Then we devise two exploratory experiments and identify that sufficient intra-class variance retained by a teacher model is an implicit distillation recipe for achieving competitive student performance. The observed properties allow us to further put forth a simple yet beneficial approach that promotes intra-class diversity at the optimizing process of the teacher models to accomplish the most promising performance of KD. Extensive experiments are conducted on various image classification tasks across three distillation paradigms, demonstrating our proposed method's effectiveness and generalization. Additionally, we offer new interpretations to receive a more in-depth cognition of the gap issues, i,e., better teacher, worse student, and the success of multi-generation self-distillation, respectively. Code will be made available at https://github.com/swift1988.

Original language	English
Pages (from-to)	4190-4201
Number of pages	12
Journal	IEEE Transactions on Circuits and Systems for Video Technology
Volume	34
Issue number	6
DOIs	https://doi.org/10.1109/TCSVT.2023.3327113
State	Published - 1 Jun 2024

Keywords

Knowledge distillation
data diversity
image classification
model compression

Access to Document

10.1109/TCSVT.2023.3327113

Cite this

@article{7918215e80db43cc8bef891205bb918b,

title = "Boosting Knowledge Distillation via Intra-Class Logit Distribution Smoothing",

abstract = "Previous arts built an intimate link between knowledge distillation (KD) and label smoothing (LS) that they both impose regularization on the model training. In this paper, we delve deeper into investigating the hidden reason rendering KD and LS to exert distinct effects on a model's potential ability in sequential knowledge transferring. Specifically, we observe that the distilled model typically exhibits much higher intra-class variance than the regularized one, consequentially acting as the better teacher. Then we devise two exploratory experiments and identify that sufficient intra-class variance retained by a teacher model is an implicit distillation recipe for achieving competitive student performance. The observed properties allow us to further put forth a simple yet beneficial approach that promotes intra-class diversity at the optimizing process of the teacher models to accomplish the most promising performance of KD. Extensive experiments are conducted on various image classification tasks across three distillation paradigms, demonstrating our proposed method's effectiveness and generalization. Additionally, we offer new interpretations to receive a more in-depth cognition of the gap issues, i,e., better teacher, worse student, and the success of multi-generation self-distillation, respectively. Code will be made available at https://github.com/swift1988.",

keywords = "Knowledge distillation, data diversity, image classification, model compression",

author = "Cong Li and Gong Cheng and Junwei Han",

note = "Publisher Copyright: {\textcopyright} 1991-2012 IEEE.",

year = "2024",

month = jun,

day = "1",

doi = "10.1109/TCSVT.2023.3327113",

language = "英语",

volume = "34",

pages = "4190--4201",

journal = "IEEE Transactions on Circuits and Systems for Video Technology",

issn = "1051-8215",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "6",

}

TY - JOUR

T1 - Boosting Knowledge Distillation via Intra-Class Logit Distribution Smoothing

AU - Li, Cong

AU - Cheng, Gong

AU - Han, Junwei

PY - 2024/6/1

Y1 - 2024/6/1

N2 - Previous arts built an intimate link between knowledge distillation (KD) and label smoothing (LS) that they both impose regularization on the model training. In this paper, we delve deeper into investigating the hidden reason rendering KD and LS to exert distinct effects on a model's potential ability in sequential knowledge transferring. Specifically, we observe that the distilled model typically exhibits much higher intra-class variance than the regularized one, consequentially acting as the better teacher. Then we devise two exploratory experiments and identify that sufficient intra-class variance retained by a teacher model is an implicit distillation recipe for achieving competitive student performance. The observed properties allow us to further put forth a simple yet beneficial approach that promotes intra-class diversity at the optimizing process of the teacher models to accomplish the most promising performance of KD. Extensive experiments are conducted on various image classification tasks across three distillation paradigms, demonstrating our proposed method's effectiveness and generalization. Additionally, we offer new interpretations to receive a more in-depth cognition of the gap issues, i,e., better teacher, worse student, and the success of multi-generation self-distillation, respectively. Code will be made available at https://github.com/swift1988.

AB - Previous arts built an intimate link between knowledge distillation (KD) and label smoothing (LS) that they both impose regularization on the model training. In this paper, we delve deeper into investigating the hidden reason rendering KD and LS to exert distinct effects on a model's potential ability in sequential knowledge transferring. Specifically, we observe that the distilled model typically exhibits much higher intra-class variance than the regularized one, consequentially acting as the better teacher. Then we devise two exploratory experiments and identify that sufficient intra-class variance retained by a teacher model is an implicit distillation recipe for achieving competitive student performance. The observed properties allow us to further put forth a simple yet beneficial approach that promotes intra-class diversity at the optimizing process of the teacher models to accomplish the most promising performance of KD. Extensive experiments are conducted on various image classification tasks across three distillation paradigms, demonstrating our proposed method's effectiveness and generalization. Additionally, we offer new interpretations to receive a more in-depth cognition of the gap issues, i,e., better teacher, worse student, and the success of multi-generation self-distillation, respectively. Code will be made available at https://github.com/swift1988.

KW - Knowledge distillation

KW - data diversity

KW - image classification

KW - model compression

UR - http://www.scopus.com/inward/record.url?scp=85176305960&partnerID=8YFLogxK

U2 - 10.1109/TCSVT.2023.3327113

DO - 10.1109/TCSVT.2023.3327113

M3 - 文章

AN - SCOPUS:85176305960

SN - 1051-8215

VL - 34

SP - 4190

EP - 4201

JO - IEEE Transactions on Circuits and Systems for Video Technology

JF - IEEE Transactions on Circuits and Systems for Video Technology

IS - 6

ER -

Boosting Knowledge Distillation via Intra-Class Logit Distribution Smoothing

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this