Exploring the Adaptation Strategy of CLIP for Few-Shot Action Recognition

Congqi Cao; Yueran Zhang; Qinyi Lv; Lingtong Min; Yanning Zhang

doi:10.1145/3688863.3689571

Exploring the Adaptation Strategy of CLIP for Few-Shot Action Recognition

Congqi Cao, Yueran Zhang, Qinyi Lv, Lingtong Min, Yanning Zhang

School of Computer Science

Northwestern Polytechnical University Xian

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

The majority of efforts on improving the performance of few-shot action recognition are dedicated to designing sophisticated temporal alignment algorithms. However, these works all heavily rely on prior knowledge within a pre-trained model. Recently, CLIP (Contrastive Language-Image Pre-Training) has shown significant few-shot learning capability in various downstream tasks. Existing works fine-tune CLIP directly on the novel classes without considering the potential utilization of the adequately labeled base class data. In this work, we conduct a thorough exploration of the adaptation strategies of CLIP for few-shot action recognition. Our findings reveal that despite using a large-scale pre-trained model such as CLIP, it remains necessary to utilize sufficient base class data, if available, to fine-tune the model rather than directly fine-tuning on the novel classes. Moreover, we compare two classical adaptation algorithms proposed to address few-shot learning: Meta-learning and Finetuning¹. Our results indicate that Meta-learning is the better method to inspire the generalization potential of the CLIP. Additionally, we propose to use an overlooked, simple but efficient fine-tuning method: partial fine-tuning, which only fine-tunes the last layer of the backbone. It requires fewer learnable parameters and less computational cost compared to full fine-tuning or fine-tuning additionally introduced adapter modules. Extensive experiments conducted on HMDB51, UCF101, and Kinetics datasets consistently demonstrate the superior generalization ability of our method, which achieves new state-of-the-art results in few-shot action recognition.

Original language	English
Title of host publication	EMCLR 2024 - Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited Resources, Co-Located with
Subtitle of host publication	MM 2024
Publisher	Association for Computing Machinery, Inc
Pages	39-48
Number of pages	10
ISBN (Electronic)	9798400711909
DOIs	https://doi.org/10.1145/3688863.3689571
State	Published - 28 Oct 2024
Event	1st International Workshop on Efficient Multimedia Computing under Limited Resources, EMCLR 2024 - Melbourne, Australia Duration: 28 Oct 2024 → 1 Nov 2024

Publication series

Name	EMCLR 2024 - Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited Resources, Co-Located with: MM 2024

Conference

Conference	1st International Workshop on Efficient Multimedia Computing under Limited Resources, EMCLR 2024
Country/Territory	Australia
City	Melbourne
Period	28/10/24 → 1/11/24

Keywords

CLIP
action recognition
few-shot learning
transfer learning

Access to Document

10.1145/3688863.3689571

Cite this

Cao, C., Zhang, Y., Lv, Q., Min, L., & Zhang, Y. (2024). Exploring the Adaptation Strategy of CLIP for Few-Shot Action Recognition. In EMCLR 2024 - Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited Resources, Co-Located with: MM 2024 (pp. 39-48). (EMCLR 2024 - Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited Resources, Co-Located with: MM 2024). Association for Computing Machinery, Inc. https://doi.org/10.1145/3688863.3689571

Cao, Congqi ; Zhang, Yueran ; Lv, Qinyi et al. / Exploring the Adaptation Strategy of CLIP for Few-Shot Action Recognition. EMCLR 2024 - Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited Resources, Co-Located with: MM 2024. Association for Computing Machinery, Inc, 2024. pp. 39-48 (EMCLR 2024 - Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited Resources, Co-Located with: MM 2024).

@inproceedings{d2e7eb11888d413cbb7d9c95f19730c5,

title = "Exploring the Adaptation Strategy of CLIP for Few-Shot Action Recognition",

abstract = "The majority of efforts on improving the performance of few-shot action recognition are dedicated to designing sophisticated temporal alignment algorithms. However, these works all heavily rely on prior knowledge within a pre-trained model. Recently, CLIP (Contrastive Language-Image Pre-Training) has shown significant few-shot learning capability in various downstream tasks. Existing works fine-tune CLIP directly on the novel classes without considering the potential utilization of the adequately labeled base class data. In this work, we conduct a thorough exploration of the adaptation strategies of CLIP for few-shot action recognition. Our findings reveal that despite using a large-scale pre-trained model such as CLIP, it remains necessary to utilize sufficient base class data, if available, to fine-tune the model rather than directly fine-tuning on the novel classes. Moreover, we compare two classical adaptation algorithms proposed to address few-shot learning: Meta-learning and Finetuning1. Our results indicate that Meta-learning is the better method to inspire the generalization potential of the CLIP. Additionally, we propose to use an overlooked, simple but efficient fine-tuning method: partial fine-tuning, which only fine-tunes the last layer of the backbone. It requires fewer learnable parameters and less computational cost compared to full fine-tuning or fine-tuning additionally introduced adapter modules. Extensive experiments conducted on HMDB51, UCF101, and Kinetics datasets consistently demonstrate the superior generalization ability of our method, which achieves new state-of-the-art results in few-shot action recognition.",

keywords = "CLIP, action recognition, few-shot learning, transfer learning",

author = "Congqi Cao and Yueran Zhang and Qinyi Lv and Lingtong Min and Yanning Zhang",

note = "Publisher Copyright: {\textcopyright} 2024 Copyright held by the owner/author(s).; 1st International Workshop on Efficient Multimedia Computing under Limited Resources, EMCLR 2024 ; Conference date: 28-10-2024 Through 01-11-2024",

year = "2024",

month = oct,

day = "28",

doi = "10.1145/3688863.3689571",

language = "英语",

series = "EMCLR 2024 - Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited Resources, Co-Located with: MM 2024",

publisher = "Association for Computing Machinery, Inc",

pages = "39--48",

booktitle = "EMCLR 2024 - Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited Resources, Co-Located with",

}

Cao, C, Zhang, Y, Lv, Q, Min, L & Zhang, Y 2024, Exploring the Adaptation Strategy of CLIP for Few-Shot Action Recognition. in EMCLR 2024 - Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited Resources, Co-Located with: MM 2024. EMCLR 2024 - Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited Resources, Co-Located with: MM 2024, Association for Computing Machinery, Inc, pp. 39-48, 1st International Workshop on Efficient Multimedia Computing under Limited Resources, EMCLR 2024, Melbourne, Australia, 28/10/24. https://doi.org/10.1145/3688863.3689571

Exploring the Adaptation Strategy of CLIP for Few-Shot Action Recognition. / Cao, Congqi; Zhang, Yueran; Lv, Qinyi et al.
EMCLR 2024 - Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited Resources, Co-Located with: MM 2024. Association for Computing Machinery, Inc, 2024. p. 39-48 (EMCLR 2024 - Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited Resources, Co-Located with: MM 2024).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Exploring the Adaptation Strategy of CLIP for Few-Shot Action Recognition

AU - Cao, Congqi

AU - Zhang, Yueran

AU - Lv, Qinyi

AU - Min, Lingtong

AU - Zhang, Yanning

PY - 2024/10/28

Y1 - 2024/10/28

N2 - The majority of efforts on improving the performance of few-shot action recognition are dedicated to designing sophisticated temporal alignment algorithms. However, these works all heavily rely on prior knowledge within a pre-trained model. Recently, CLIP (Contrastive Language-Image Pre-Training) has shown significant few-shot learning capability in various downstream tasks. Existing works fine-tune CLIP directly on the novel classes without considering the potential utilization of the adequately labeled base class data. In this work, we conduct a thorough exploration of the adaptation strategies of CLIP for few-shot action recognition. Our findings reveal that despite using a large-scale pre-trained model such as CLIP, it remains necessary to utilize sufficient base class data, if available, to fine-tune the model rather than directly fine-tuning on the novel classes. Moreover, we compare two classical adaptation algorithms proposed to address few-shot learning: Meta-learning and Finetuning1. Our results indicate that Meta-learning is the better method to inspire the generalization potential of the CLIP. Additionally, we propose to use an overlooked, simple but efficient fine-tuning method: partial fine-tuning, which only fine-tunes the last layer of the backbone. It requires fewer learnable parameters and less computational cost compared to full fine-tuning or fine-tuning additionally introduced adapter modules. Extensive experiments conducted on HMDB51, UCF101, and Kinetics datasets consistently demonstrate the superior generalization ability of our method, which achieves new state-of-the-art results in few-shot action recognition.

AB - The majority of efforts on improving the performance of few-shot action recognition are dedicated to designing sophisticated temporal alignment algorithms. However, these works all heavily rely on prior knowledge within a pre-trained model. Recently, CLIP (Contrastive Language-Image Pre-Training) has shown significant few-shot learning capability in various downstream tasks. Existing works fine-tune CLIP directly on the novel classes without considering the potential utilization of the adequately labeled base class data. In this work, we conduct a thorough exploration of the adaptation strategies of CLIP for few-shot action recognition. Our findings reveal that despite using a large-scale pre-trained model such as CLIP, it remains necessary to utilize sufficient base class data, if available, to fine-tune the model rather than directly fine-tuning on the novel classes. Moreover, we compare two classical adaptation algorithms proposed to address few-shot learning: Meta-learning and Finetuning1. Our results indicate that Meta-learning is the better method to inspire the generalization potential of the CLIP. Additionally, we propose to use an overlooked, simple but efficient fine-tuning method: partial fine-tuning, which only fine-tunes the last layer of the backbone. It requires fewer learnable parameters and less computational cost compared to full fine-tuning or fine-tuning additionally introduced adapter modules. Extensive experiments conducted on HMDB51, UCF101, and Kinetics datasets consistently demonstrate the superior generalization ability of our method, which achieves new state-of-the-art results in few-shot action recognition.

KW - CLIP

KW - action recognition

KW - few-shot learning

KW - transfer learning

UR - http://www.scopus.com/inward/record.url?scp=85210804175&partnerID=8YFLogxK

U2 - 10.1145/3688863.3689571

DO - 10.1145/3688863.3689571

M3 - 会议稿件

AN - SCOPUS:85210804175

T3 - EMCLR 2024 - Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited Resources, Co-Located with: MM 2024

SP - 39

EP - 48

BT - EMCLR 2024 - Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited Resources, Co-Located with

PB - Association for Computing Machinery, Inc

T2 - 1st International Workshop on Efficient Multimedia Computing under Limited Resources, EMCLR 2024

Y2 - 28 October 2024 through 1 November 2024

ER -

Cao C, Zhang Y, Lv Q, Min L, Zhang Y. Exploring the Adaptation Strategy of CLIP for Few-Shot Action Recognition. In EMCLR 2024 - Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited Resources, Co-Located with: MM 2024. Association for Computing Machinery, Inc. 2024. p. 39-48. (EMCLR 2024 - Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited Resources, Co-Located with: MM 2024). doi: 10.1145/3688863.3689571

Exploring the Adaptation Strategy of CLIP for Few-Shot Action Recognition

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this