TY - GEN
T1 - Exploring the Adaptation Strategy of CLIP for Few-Shot Action Recognition
AU - Cao, Congqi
AU - Zhang, Yueran
AU - Lv, Qinyi
AU - Min, Lingtong
AU - Zhang, Yanning
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s).
PY - 2024/10/28
Y1 - 2024/10/28
N2 - The majority of efforts on improving the performance of few-shot action recognition are dedicated to designing sophisticated temporal alignment algorithms. However, these works all heavily rely on prior knowledge within a pre-trained model. Recently, CLIP (Contrastive Language-Image Pre-Training) has shown significant few-shot learning capability in various downstream tasks. Existing works fine-tune CLIP directly on the novel classes without considering the potential utilization of the adequately labeled base class data. In this work, we conduct a thorough exploration of the adaptation strategies of CLIP for few-shot action recognition. Our findings reveal that despite using a large-scale pre-trained model such as CLIP, it remains necessary to utilize sufficient base class data, if available, to fine-tune the model rather than directly fine-tuning on the novel classes. Moreover, we compare two classical adaptation algorithms proposed to address few-shot learning: Meta-learning and Finetuning1. Our results indicate that Meta-learning is the better method to inspire the generalization potential of the CLIP. Additionally, we propose to use an overlooked, simple but efficient fine-tuning method: partial fine-tuning, which only fine-tunes the last layer of the backbone. It requires fewer learnable parameters and less computational cost compared to full fine-tuning or fine-tuning additionally introduced adapter modules. Extensive experiments conducted on HMDB51, UCF101, and Kinetics datasets consistently demonstrate the superior generalization ability of our method, which achieves new state-of-the-art results in few-shot action recognition.
AB - The majority of efforts on improving the performance of few-shot action recognition are dedicated to designing sophisticated temporal alignment algorithms. However, these works all heavily rely on prior knowledge within a pre-trained model. Recently, CLIP (Contrastive Language-Image Pre-Training) has shown significant few-shot learning capability in various downstream tasks. Existing works fine-tune CLIP directly on the novel classes without considering the potential utilization of the adequately labeled base class data. In this work, we conduct a thorough exploration of the adaptation strategies of CLIP for few-shot action recognition. Our findings reveal that despite using a large-scale pre-trained model such as CLIP, it remains necessary to utilize sufficient base class data, if available, to fine-tune the model rather than directly fine-tuning on the novel classes. Moreover, we compare two classical adaptation algorithms proposed to address few-shot learning: Meta-learning and Finetuning1. Our results indicate that Meta-learning is the better method to inspire the generalization potential of the CLIP. Additionally, we propose to use an overlooked, simple but efficient fine-tuning method: partial fine-tuning, which only fine-tunes the last layer of the backbone. It requires fewer learnable parameters and less computational cost compared to full fine-tuning or fine-tuning additionally introduced adapter modules. Extensive experiments conducted on HMDB51, UCF101, and Kinetics datasets consistently demonstrate the superior generalization ability of our method, which achieves new state-of-the-art results in few-shot action recognition.
KW - CLIP
KW - action recognition
KW - few-shot learning
KW - transfer learning
UR - http://www.scopus.com/inward/record.url?scp=85210804175&partnerID=8YFLogxK
U2 - 10.1145/3688863.3689571
DO - 10.1145/3688863.3689571
M3 - 会议稿件
AN - SCOPUS:85210804175
T3 - EMCLR 2024 - Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited Resources, Co-Located with: MM 2024
SP - 39
EP - 48
BT - EMCLR 2024 - Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited Resources, Co-Located with
PB - Association for Computing Machinery, Inc
T2 - 1st International Workshop on Efficient Multimedia Computing under Limited Resources, EMCLR 2024
Y2 - 28 October 2024 through 1 November 2024
ER -