TY - JOUR
T1 - CADS
T2 - A Self-Supervised Learner via Cross-Modal Alignment and Deep Self-Distillation for CT Volume Segmentation
AU - Ye, Yiwen
AU - Zhang, Jianpeng
AU - Chen, Ziyang
AU - Xia, Yong
N1 - Publisher Copyright:
© 1982-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Self-supervised learning (SSL) has long had great success in advancing the field of annotation-efficient learning. However, when applied to CT volume segmentation, most SSL methods suffer from two limitations, including rarely using the information acquired by different imaging modalities and providing supervision only to the bottleneck encoder layer. To address both limitations, we design a pretext task to align the information in each 3D CT volume and the corresponding 2D generated X-ray image and extend self-distillation to deep self-distillation. Thus, we propose a self-supervised learner based on Cross-modal Alignment and Deep Self-distillation (CADS) to improve the encoder's ability to characterize CT volumes. The cross-modal alignment is a more challenging pretext task that forces the encoder to learn better image representation ability. Deep self-distillation provides supervision to not only the bottleneck layer but also shallow layers, thus boosting the abilities of both. Comparative experiments show that, during pre-training, our CADS has lower computational complexity and GPU memory cost than competing SSL methods. Based on the pre-trained encoder, we construct PVT-UNet for 3D CT volume segmentation. Our results on seven downstream tasks indicate that PVT-UNet outperforms state-of-the-art SSL methods like MOCOv3 and DiRA, as well as prevalent medical image segmentation methods like nnUNet and CoTr. Code and pre-trained weight will be available at https://github.com/yeerwen/CADS.
AB - Self-supervised learning (SSL) has long had great success in advancing the field of annotation-efficient learning. However, when applied to CT volume segmentation, most SSL methods suffer from two limitations, including rarely using the information acquired by different imaging modalities and providing supervision only to the bottleneck encoder layer. To address both limitations, we design a pretext task to align the information in each 3D CT volume and the corresponding 2D generated X-ray image and extend self-distillation to deep self-distillation. Thus, we propose a self-supervised learner based on Cross-modal Alignment and Deep Self-distillation (CADS) to improve the encoder's ability to characterize CT volumes. The cross-modal alignment is a more challenging pretext task that forces the encoder to learn better image representation ability. Deep self-distillation provides supervision to not only the bottleneck layer but also shallow layers, thus boosting the abilities of both. Comparative experiments show that, during pre-training, our CADS has lower computational complexity and GPU memory cost than competing SSL methods. Based on the pre-trained encoder, we construct PVT-UNet for 3D CT volume segmentation. Our results on seven downstream tasks indicate that PVT-UNet outperforms state-of-the-art SSL methods like MOCOv3 and DiRA, as well as prevalent medical image segmentation methods like nnUNet and CoTr. Code and pre-trained weight will be available at https://github.com/yeerwen/CADS.
KW - CT volume segmentation
KW - Self-supervised learning
KW - cross-modal alignment
KW - deep self-distillation
UR - http://www.scopus.com/inward/record.url?scp=85199353989&partnerID=8YFLogxK
U2 - 10.1109/TMI.2024.3431916
DO - 10.1109/TMI.2024.3431916
M3 - 文章
AN - SCOPUS:85199353989
SN - 0278-0062
VL - 44
SP - 118
EP - 129
JO - IEEE Transactions on Medical Imaging
JF - IEEE Transactions on Medical Imaging
IS - 1
ER -