Keshigeyan Chandrasegaran /
Ngoc‑Trung Tran /
Yunqing Zhao /
Ngai‑Man Cheung
Singapore University of Technology and Design (SUTD)
ICML 2022
Project |
ICML Paper |
Pre-trained Models
This work investigates the compatibility between label smoothing (LS) and knowledge distillation (KD). Contemporary findings addressing this thesis statement take dichotomous standpoints: Muller et al. (2019); Shen et al. (2021). Critically, there is no effort to understand and resolve these contradictory findings, leaving the primal question — to smooth or not to smooth a teacher network? — unanswered. The main contributions of our work are the discovery, analysis and validation of systematic diffusion as the missing concept which is instrumental in understanding and resolving these contradictory findings. This systematic diffusion essentially curtails the benefits of distilling from an LS-trained teacher, thereby rendering KD at increased temperatures ineffective. Our discovery is comprehensively supported by large-scale experiments, analyses and case studies including image classification, neural machine translation and compact student distillation tasks spanning across multiple datasets and teacher-student architectures. Based on our analysis, we suggest practitioners to use an LS-trained teacher with a low-temperature transfer to achieve high performance students.
A rule of thumb for practitioners. We suggest to use an LS-trained teacher with a low-temperature transfer (i.e. T = 1) to render high performance students.
This codebase is written in Pytorch. It is clearly documented with bash file execution points exposing all required arguments and hyper-parameters. We also provide Docker container details to run our code.
✅ Pytorch
✅ NVIDIA DALI
✅ Multi-GPU / Mixed-Precision training
✅ DockerFile
ImageNet-1K LS / KD experiments : Clear steps on how to run and reproduce our results for ImageNet-1K LS and KD (Table 2, B.3) are provided in src/image_classification/README.md. We support Multi-GPU training and mixed-precision training. We use NVIDIA DALI library for training student networks.
Machine Translation experiments : Clear steps on how to run and reproduce our results for machine translation LS and KD (Table 5, B.2) are provided in src/neural_machine_translation/README.md. We use [1] following exact procedure as [2]
CUB200-2011 experiments : Clear steps on how to run and reproduce our results for fine-grained image classification (CUB200) LS and KD (Table 2, B.1) are provided in src/image_classification/README.md. We support Multi-GPU training and mixed-precision training.
Compact Student Distillation : Clear steps on how to run and reproduce our results for Compact Student distillation LS and KD (Table 4, B.3) are provided in src/image_classification/README.md. We support Multi-GPU training and mixed-precision training.
Penultimate Layer Visualization : Pseudocode for Penultimate visualization algorithm is provided in src/visualization/visualization_algorithm.png Refer src/visualization/alpha-LS-KD_imagenet_centroids.py for Penultimate layer visualization code to reproduce all visualizations in the main paper and Supplementary (Figures 1, A.1, A.2). The code is clearly documented.
ResNet-50 → ResNet-18 KD | |||
---|---|---|---|
|
|
|
|
Teacher : ResNet-50 | - | 76.132 / 92.862 | 76.200 / 93.082 |
Student : ResNet-18 | T = 1 | 71.488 / 90.272 | 71.666 / 90.364 |
Student : ResNet-18 | T = 2 | 71.360 / 90.362 | 68.860 / 89.352 |
Student : ResNet-18 | T = 3 | 69.674 / 89.698 | 67.752 / 88.932 |
Student : ResNet-18 | T = 64 | 66.194 / 88.706 | 64.362 / 87.698 |
ResNet-50 → ResNet-50 KD | |||
---|---|---|---|
|
|
|
|
Teacher : ResNet-50 | - | 76.132 / 92.862 | 76.200 / 93.082 |
Student : ResNet-50 | T = 1 | 76.328 / 92.996 | 76.896 / 93.236 |
Student : ResNet-50 | T = 2 | 76.180 / 93.072 | 76.110 / 93.138 |
Student : ResNet-50 | T = 3 | 75.488 / 92.670 | 75.790 / 93.006 |
Student : ResNet-50 | T = 64 | 74.278 / 92.410 | 74.566 / 92.596 |
Results produced with 20.12-py3 (Nvidia Pytorch Docker container) + Pytorch LTS 1.8.2 + CUDA11.1
All pretrained image classification, fine-graind image clasification, neural machine translation and compact student distillation models are available here
@InProceedings{pmlr-v162-chandrasegaran22a,
author = {Chandrasegaran, Keshigeyan and Tran, Ngoc-Trung and Zhao, Yunqing and Cheung, Ngai-Man},
title = {Revisiting Label Smoothing and Knowledge Distillation Compatibility: What was Missing?},
booktitle = {Proceedings of the 39th International Conference on Machine Learning},
pages = {2890-2916},
year = {2022},
editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
volume = {162},
series = {Proceedings of Machine Learning Research},
month = {17-23 Jul},
publisher = {PMLR},
}
We gratefully acknowledge the following works and libraries:
- Pytorch official ImageNet Training : https://github.com/pytorch/examples/tree/main/imagenet
- DALI ImageNet Training : https://github.com/NVIDIA/DALI/blob/ce25d722bc47b8b4f3633ef008a85535db305789/docs/examples/use_cases/pytorch/resnet50/main.py
- Multilingual NMT with Knowledge Distillation on Fairseq (ICLR'19) : https://github.com/RayeRen/multilingual-kd-pytorch
- FairSeq Library : https://github.com/facebookresearch/fairseq
- Experiment Tracking with Weights and Biases : https://www.wandb.com/
Special thanks to Lingeng Foo and Timothy Liu for valuable discussion.
[1] Tan, Xu, et al. "Multilingual Neural Machine Translation with Knowledge Distillation." International Conference on Learning Representations. 2019.
[2] Shen, Zhiqiang, et al. "Is Label Smoothing Truly Incompatible with Knowledge Distillation: An Empirical Study." International Conference on Learning Representations. 2021.