Exploring the Adaptation Strategy of CLIP for Few-Shot Action Recognition

Congqi Cao, Yueran Zhang, Qinyi Lv, Lingtong Min, Yanning Zhang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The majority of efforts on improving the performance of few-shot action recognition are dedicated to designing sophisticated temporal alignment algorithms. However, these works all heavily rely on prior knowledge within a pre-trained model. Recently, CLIP (Contrastive Language-Image Pre-Training) has shown significant few-shot learning capability in various downstream tasks. Existing works fine-tune CLIP directly on the novel classes without considering the potential utilization of the adequately labeled base class data. In this work, we conduct a thorough exploration of the adaptation strategies of CLIP for few-shot action recognition. Our findings reveal that despite using a large-scale pre-trained model such as CLIP, it remains necessary to utilize sufficient base class data, if available, to fine-tune the model rather than directly fine-tuning on the novel classes. Moreover, we compare two classical adaptation algorithms proposed to address few-shot learning: Meta-learning and Finetuning1. Our results indicate that Meta-learning is the better method to inspire the generalization potential of the CLIP. Additionally, we propose to use an overlooked, simple but efficient fine-tuning method: partial fine-tuning, which only fine-tunes the last layer of the backbone. It requires fewer learnable parameters and less computational cost compared to full fine-tuning or fine-tuning additionally introduced adapter modules. Extensive experiments conducted on HMDB51, UCF101, and Kinetics datasets consistently demonstrate the superior generalization ability of our method, which achieves new state-of-the-art results in few-shot action recognition.

Original languageEnglish
Title of host publicationEMCLR 2024 - Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited Resources, Co-Located with
Subtitle of host publicationMM 2024
PublisherAssociation for Computing Machinery, Inc
Pages39-48
Number of pages10
ISBN (Electronic)9798400711909
DOIs
StatePublished - 28 Oct 2024
Event1st International Workshop on Efficient Multimedia Computing under Limited Resources, EMCLR 2024 - Melbourne, Australia
Duration: 28 Oct 20241 Nov 2024

Publication series

NameEMCLR 2024 - Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited Resources, Co-Located with: MM 2024

Conference

Conference1st International Workshop on Efficient Multimedia Computing under Limited Resources, EMCLR 2024
Country/TerritoryAustralia
CityMelbourne
Period28/10/241/11/24

Keywords

  • CLIP
  • action recognition
  • few-shot learning
  • transfer learning

Fingerprint

Dive into the research topics of 'Exploring the Adaptation Strategy of CLIP for Few-Shot Action Recognition'. Together they form a unique fingerprint.

Cite this