Task-Adapter: Task-specific Adaptation of Image Models for Few-shot Action Recognition

Congqi Cao; Yueran Zhang; Yating Yu; Qinyi Lv; Lingtong Min; Yanning Zhang

doi:10.1145/3664647.3681081

Task-Adapter: Task-specific Adaptation of Image Models for Few-shot Action Recognition

Congqi Cao, Yueran Zhang, Yating Yu, Qinyi Lv, Lingtong Min, Yanning Zhang

School of Computer Science

Northwestern Polytechnical University Xian

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

1 Scopus citations

Abstract

Existing works in few-shot action recognition mostly fine-tune a pre-trained image model and design sophisticated temporal alignment modules at feature level. However, simply fully fine-tuning the pre-trained model could cause overfitting due to the scarcity of video samples. Additionally, we argue that the exploration of task-specific information is insufficient when relying solely on well extracted abstract features. In this work, we propose a simple but effective task-specific adaptation method (Task-Adapter) for few-shot action recognition. By introducing the proposed Task-Adapter into the last several layers of the backbone and keeping the parameters of the original pre-trained model frozen, we mitigate the overfitting problem caused by full fine-tuning and advance the task-specific mechanism into the process of feature extraction. In each Task-Adapter, we reuse the frozen self-attention layer to perform task-specific self-attention across different videos within the given task to capture both distinctive information among classes and shared information within classes, which facilitates task-specific adaptation and enhances subsequent metric measurement between the query feature and support prototypes. Experimental results consistently demonstrate the effectiveness of our proposed Task-Adapter on four standard few-shot action recognition datasets. Especially on temporal challenging SSv2 dataset, our method outperforms the state-of-the-art methods by a large margin.

Original language	English
Title of host publication	MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
Publisher	Association for Computing Machinery, Inc
Pages	9038-9047
Number of pages	10
ISBN (Electronic)	9798400706868
DOIs	https://doi.org/10.1145/3664647.3681081
State	Published - 28 Oct 2024
Event	32nd ACM International Conference on Multimedia, MM 2024 - Melbourne, Australia Duration: 28 Oct 2024 → 1 Nov 2024

Publication series

Name	MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

Conference

Conference	32nd ACM International Conference on Multimedia, MM 2024
Country/Territory	Australia
City	Melbourne
Period	28/10/24 → 1/11/24

Keywords

few-shot action recognition
parameter-efficient fine-tuning
task-specific adaptation

Access to Document

10.1145/3664647.3681081

Cite this

Cao, C., Zhang, Y., Yu, Y., Lv, Q., Min, L., & Zhang, Y. (2024). Task-Adapter: Task-specific Adaptation of Image Models for Few-shot Action Recognition. In MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia (pp. 9038-9047). (MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia). Association for Computing Machinery, Inc. https://doi.org/10.1145/3664647.3681081

@inproceedings{c3645a5dfb5e488294136d42c04ce08f,

title = "Task-Adapter: Task-specific Adaptation of Image Models for Few-shot Action Recognition",

abstract = "Existing works in few-shot action recognition mostly fine-tune a pre-trained image model and design sophisticated temporal alignment modules at feature level. However, simply fully fine-tuning the pre-trained model could cause overfitting due to the scarcity of video samples. Additionally, we argue that the exploration of task-specific information is insufficient when relying solely on well extracted abstract features. In this work, we propose a simple but effective task-specific adaptation method (Task-Adapter) for few-shot action recognition. By introducing the proposed Task-Adapter into the last several layers of the backbone and keeping the parameters of the original pre-trained model frozen, we mitigate the overfitting problem caused by full fine-tuning and advance the task-specific mechanism into the process of feature extraction. In each Task-Adapter, we reuse the frozen self-attention layer to perform task-specific self-attention across different videos within the given task to capture both distinctive information among classes and shared information within classes, which facilitates task-specific adaptation and enhances subsequent metric measurement between the query feature and support prototypes. Experimental results consistently demonstrate the effectiveness of our proposed Task-Adapter on four standard few-shot action recognition datasets. Especially on temporal challenging SSv2 dataset, our method outperforms the state-of-the-art methods by a large margin.",

keywords = "few-shot action recognition, parameter-efficient fine-tuning, task-specific adaptation",

author = "Congqi Cao and Yueran Zhang and Yating Yu and Qinyi Lv and Lingtong Min and Yanning Zhang",

note = "Publisher Copyright: {\textcopyright} 2024 ACM.; 32nd ACM International Conference on Multimedia, MM 2024 ; Conference date: 28-10-2024 Through 01-11-2024",

year = "2024",

month = oct,

day = "28",

doi = "10.1145/3664647.3681081",

language = "英语",

series = "MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia",

publisher = "Association for Computing Machinery, Inc",

pages = "9038--9047",

booktitle = "MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia",

}

Cao, C, Zhang, Y, Yu, Y, Lv, Q, Min, L & Zhang, Y 2024, Task-Adapter: Task-specific Adaptation of Image Models for Few-shot Action Recognition. in MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia. MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia, Association for Computing Machinery, Inc, pp. 9038-9047, 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, Australia, 28/10/24. https://doi.org/10.1145/3664647.3681081

Task-Adapter: Task-specific Adaptation of Image Models for Few-shot Action Recognition. / Cao, Congqi; Zhang, Yueran; Yu, Yating et al.
MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia. Association for Computing Machinery, Inc, 2024. p. 9038-9047 (MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Task-Adapter

T2 - 32nd ACM International Conference on Multimedia, MM 2024

AU - Cao, Congqi

AU - Zhang, Yueran

AU - Yu, Yating

AU - Lv, Qinyi

AU - Min, Lingtong

AU - Zhang, Yanning

PY - 2024/10/28

Y1 - 2024/10/28

N2 - Existing works in few-shot action recognition mostly fine-tune a pre-trained image model and design sophisticated temporal alignment modules at feature level. However, simply fully fine-tuning the pre-trained model could cause overfitting due to the scarcity of video samples. Additionally, we argue that the exploration of task-specific information is insufficient when relying solely on well extracted abstract features. In this work, we propose a simple but effective task-specific adaptation method (Task-Adapter) for few-shot action recognition. By introducing the proposed Task-Adapter into the last several layers of the backbone and keeping the parameters of the original pre-trained model frozen, we mitigate the overfitting problem caused by full fine-tuning and advance the task-specific mechanism into the process of feature extraction. In each Task-Adapter, we reuse the frozen self-attention layer to perform task-specific self-attention across different videos within the given task to capture both distinctive information among classes and shared information within classes, which facilitates task-specific adaptation and enhances subsequent metric measurement between the query feature and support prototypes. Experimental results consistently demonstrate the effectiveness of our proposed Task-Adapter on four standard few-shot action recognition datasets. Especially on temporal challenging SSv2 dataset, our method outperforms the state-of-the-art methods by a large margin.

AB - Existing works in few-shot action recognition mostly fine-tune a pre-trained image model and design sophisticated temporal alignment modules at feature level. However, simply fully fine-tuning the pre-trained model could cause overfitting due to the scarcity of video samples. Additionally, we argue that the exploration of task-specific information is insufficient when relying solely on well extracted abstract features. In this work, we propose a simple but effective task-specific adaptation method (Task-Adapter) for few-shot action recognition. By introducing the proposed Task-Adapter into the last several layers of the backbone and keeping the parameters of the original pre-trained model frozen, we mitigate the overfitting problem caused by full fine-tuning and advance the task-specific mechanism into the process of feature extraction. In each Task-Adapter, we reuse the frozen self-attention layer to perform task-specific self-attention across different videos within the given task to capture both distinctive information among classes and shared information within classes, which facilitates task-specific adaptation and enhances subsequent metric measurement between the query feature and support prototypes. Experimental results consistently demonstrate the effectiveness of our proposed Task-Adapter on four standard few-shot action recognition datasets. Especially on temporal challenging SSv2 dataset, our method outperforms the state-of-the-art methods by a large margin.

KW - few-shot action recognition

KW - parameter-efficient fine-tuning

KW - task-specific adaptation

UR - http://www.scopus.com/inward/record.url?scp=85209777334&partnerID=8YFLogxK

U2 - 10.1145/3664647.3681081

DO - 10.1145/3664647.3681081

M3 - 会议稿件

AN - SCOPUS:85209777334

T3 - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

SP - 9038

EP - 9047

BT - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

PB - Association for Computing Machinery, Inc

Y2 - 28 October 2024 through 1 November 2024

ER -

Cao C, Zhang Y, Yu Y, Lv Q, Min L, Zhang Y. Task-Adapter: Task-specific Adaptation of Image Models for Few-shot Action Recognition. In MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia. Association for Computing Machinery, Inc. 2024. p. 9038-9047. (MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia). doi: 10.1145/3664647.3681081

Task-Adapter: Task-specific Adaptation of Image Models for Few-shot Action Recognition

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this