Boosting Knowledge Distillation via Intra-Class Logit Distribution Smoothing

Research output: Contribution to journalArticlepeer-review

12 Scopus citations

Abstract

Previous arts built an intimate link between knowledge distillation (KD) and label smoothing (LS) that they both impose regularization on the model training. In this paper, we delve deeper into investigating the hidden reason rendering KD and LS to exert distinct effects on a model's potential ability in sequential knowledge transferring. Specifically, we observe that the distilled model typically exhibits much higher intra-class variance than the regularized one, consequentially acting as the better teacher. Then we devise two exploratory experiments and identify that sufficient intra-class variance retained by a teacher model is an implicit distillation recipe for achieving competitive student performance. The observed properties allow us to further put forth a simple yet beneficial approach that promotes intra-class diversity at the optimizing process of the teacher models to accomplish the most promising performance of KD. Extensive experiments are conducted on various image classification tasks across three distillation paradigms, demonstrating our proposed method's effectiveness and generalization. Additionally, we offer new interpretations to receive a more in-depth cognition of the gap issues, i,e., better teacher, worse student, and the success of multi-generation self-distillation, respectively. Code will be made available at https://github.com/swift1988.

Original languageEnglish
Pages (from-to)4190-4201
Number of pages12
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume34
Issue number6
DOIs
StatePublished - 1 Jun 2024

Keywords

  • Knowledge distillation
  • data diversity
  • image classification
  • model compression

Fingerprint

Dive into the research topics of 'Boosting Knowledge Distillation via Intra-Class Logit Distribution Smoothing'. Together they form a unique fingerprint.

Cite this