TY - GEN
T1 - Boosting Multi-Modal Alignment
T2 - 33rd ACM International Conference on Multimedia, MM 2025
AU - Liang, Guoqiang
AU - Qin, Chuan
AU - Cheng, De
AU - Zhang, Shizhou
AU - Zhang, Yanning
N1 - Publisher Copyright:
© 2025 ACM.
PY - 2025/10/27
Y1 - 2025/10/27
N2 - Class Incremental Learning (CIL) aims to continually learn new classes from a stream of data without forgetting previously learned ones. Recent approaches have leveraged pre-trained models (PTMs) to improve performance, especially vision-language models, which offer better generalization than models trained solely on visual data. Many of these methods rely on simple language templates to generate class representations, which then serve as classifiers. However, due to differences between the pre-training data and downstream tasks, these textual features can become too similar for certain classes, leading to prediction errors. To address this issue, we propose a method that optimizes the geometric structure of both visual and textual features across different classes. Inspired by neural collapse theory, we introduce a multi-modal alignment strategy: for each class, a reference vector is chosen from a simplex Equiangular Tight Frame, and both the visual and textual features of the class are aligned with this vector. To better capture intra-class variations, we also construct multiple visual prototypes for each class. A multi-prototype supervised contrastive loss is then employed to pull an image feature toward the closest matching prototype of its true class and push it away from prototypes of other classes. We evaluate our approach on five widely used CIL benchmarks. The results show that our method achieves state-of-the-art performance, demonstrating its effectiveness in addressing the challenges of class incremental learning. Our code is available at https://github.com/qcNPU/NCSCMP.
AB - Class Incremental Learning (CIL) aims to continually learn new classes from a stream of data without forgetting previously learned ones. Recent approaches have leveraged pre-trained models (PTMs) to improve performance, especially vision-language models, which offer better generalization than models trained solely on visual data. Many of these methods rely on simple language templates to generate class representations, which then serve as classifiers. However, due to differences between the pre-training data and downstream tasks, these textual features can become too similar for certain classes, leading to prediction errors. To address this issue, we propose a method that optimizes the geometric structure of both visual and textual features across different classes. Inspired by neural collapse theory, we introduce a multi-modal alignment strategy: for each class, a reference vector is chosen from a simplex Equiangular Tight Frame, and both the visual and textual features of the class are aligned with this vector. To better capture intra-class variations, we also construct multiple visual prototypes for each class. A multi-prototype supervised contrastive loss is then employed to pull an image feature toward the closest matching prototype of its true class and push it away from prototypes of other classes. We evaluate our approach on five widely used CIL benchmarks. The results show that our method achieves state-of-the-art performance, demonstrating its effectiveness in addressing the challenges of class incremental learning. Our code is available at https://github.com/qcNPU/NCSCMP.
KW - class incremental learning
KW - multiple prototypes
KW - visual-language model
UR - https://www.scopus.com/pages/publications/105024074270
U2 - 10.1145/3746027.3755519
DO - 10.1145/3746027.3755519
M3 - 会议稿件
AN - SCOPUS:105024074270
T3 - MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
SP - 1880
EP - 1889
BT - MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
PB - Association for Computing Machinery, Inc
Y2 - 27 October 2025 through 31 October 2025
ER -