TY - JOUR
T1 - Boosting Knowledge Distillation via Intra-Class Logit Distribution Smoothing
AU - Li, Cong
AU - Cheng, Gong
AU - Han, Junwei
N1 - Publisher Copyright:
© 1991-2012 IEEE.
PY - 2024/6/1
Y1 - 2024/6/1
N2 - Previous arts built an intimate link between knowledge distillation (KD) and label smoothing (LS) that they both impose regularization on the model training. In this paper, we delve deeper into investigating the hidden reason rendering KD and LS to exert distinct effects on a model's potential ability in sequential knowledge transferring. Specifically, we observe that the distilled model typically exhibits much higher intra-class variance than the regularized one, consequentially acting as the better teacher. Then we devise two exploratory experiments and identify that sufficient intra-class variance retained by a teacher model is an implicit distillation recipe for achieving competitive student performance. The observed properties allow us to further put forth a simple yet beneficial approach that promotes intra-class diversity at the optimizing process of the teacher models to accomplish the most promising performance of KD. Extensive experiments are conducted on various image classification tasks across three distillation paradigms, demonstrating our proposed method's effectiveness and generalization. Additionally, we offer new interpretations to receive a more in-depth cognition of the gap issues, i,e., better teacher, worse student, and the success of multi-generation self-distillation, respectively. Code will be made available at https://github.com/swift1988.
AB - Previous arts built an intimate link between knowledge distillation (KD) and label smoothing (LS) that they both impose regularization on the model training. In this paper, we delve deeper into investigating the hidden reason rendering KD and LS to exert distinct effects on a model's potential ability in sequential knowledge transferring. Specifically, we observe that the distilled model typically exhibits much higher intra-class variance than the regularized one, consequentially acting as the better teacher. Then we devise two exploratory experiments and identify that sufficient intra-class variance retained by a teacher model is an implicit distillation recipe for achieving competitive student performance. The observed properties allow us to further put forth a simple yet beneficial approach that promotes intra-class diversity at the optimizing process of the teacher models to accomplish the most promising performance of KD. Extensive experiments are conducted on various image classification tasks across three distillation paradigms, demonstrating our proposed method's effectiveness and generalization. Additionally, we offer new interpretations to receive a more in-depth cognition of the gap issues, i,e., better teacher, worse student, and the success of multi-generation self-distillation, respectively. Code will be made available at https://github.com/swift1988.
KW - Knowledge distillation
KW - data diversity
KW - image classification
KW - model compression
UR - http://www.scopus.com/inward/record.url?scp=85176305960&partnerID=8YFLogxK
U2 - 10.1109/TCSVT.2023.3327113
DO - 10.1109/TCSVT.2023.3327113
M3 - 文章
AN - SCOPUS:85176305960
SN - 1051-8215
VL - 34
SP - 4190
EP - 4201
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 6
ER -