Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model

Yinghui Xing; Qirui Wu; De Cheng; Shizhou Zhang; Guoqiang Liang; Peng Wang; Yanning Zhang

doi:10.1109/TMM.2023.3291588

Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model

Yinghui Xing, Qirui Wu, De Cheng, Shizhou Zhang, Guoqiang Liang, Peng Wang, Yanning Zhang

School of Computer Science

Research output: Contribution to journal › Article › peer-review

28 Scopus citations

Abstract

With the emergence of large pretrained vison-language models such as CLIP, transferable representations can be adapted to a wide range of downstream tasks via prompt tuning. Prompt tuning probes for beneficial information for downstream tasks from the general knowledge stored in the pretrained model. A recently proposed method named Context Optimization (CoOp) introduces a set of learnable vectors as text prompts from the language side. However, tuning the text prompt alone can only adjust the synthesized 'classifier', while the computed visual features of the image encoder cannot be affected, thus leading to suboptimal solutions. In this article, we propose a novel dual-modality prompt tuning (DPT) paradigm through learning text and visual prompts simultaneously. To make the final image feature concentrate more on the target visual concept, a class-aware visual prompt tuning (CAVPT) scheme is further proposed in our DPT. In this scheme, the class-aware visual prompt is generated dynamically by performing the cross attention between text prompt features and image patch token embeddings to encode both the downstream task-related information and visual instance information. Extensive experimental results on 11 datasets demonstrate the effectiveness and generalization ability of the proposed method.

Original language	English
Pages (from-to)	2056-2068
Number of pages	13
Journal	IEEE Transactions on Multimedia
Volume	26
DOIs	https://doi.org/10.1109/TMM.2023.3291588
State	Published - 2024

Keywords

Few-shot learning
image classification
prompt tuning
transfer learning
vision-language model

Access to Document

10.1109/TMM.2023.3291588

Cite this

@article{36605be3a16c48a0987d3f9f326683d6,

title = "Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model",

abstract = "With the emergence of large pretrained vison-language models such as CLIP, transferable representations can be adapted to a wide range of downstream tasks via prompt tuning. Prompt tuning probes for beneficial information for downstream tasks from the general knowledge stored in the pretrained model. A recently proposed method named Context Optimization (CoOp) introduces a set of learnable vectors as text prompts from the language side. However, tuning the text prompt alone can only adjust the synthesized 'classifier', while the computed visual features of the image encoder cannot be affected, thus leading to suboptimal solutions. In this article, we propose a novel dual-modality prompt tuning (DPT) paradigm through learning text and visual prompts simultaneously. To make the final image feature concentrate more on the target visual concept, a class-aware visual prompt tuning (CAVPT) scheme is further proposed in our DPT. In this scheme, the class-aware visual prompt is generated dynamically by performing the cross attention between text prompt features and image patch token embeddings to encode both the downstream task-related information and visual instance information. Extensive experimental results on 11 datasets demonstrate the effectiveness and generalization ability of the proposed method.",

keywords = "Few-shot learning, image classification, prompt tuning, transfer learning, vision-language model",

author = "Yinghui Xing and Qirui Wu and De Cheng and Shizhou Zhang and Guoqiang Liang and Peng Wang and Yanning Zhang",

note = "Publisher Copyright: {\textcopyright} 1999-2012 IEEE.",

year = "2024",

doi = "10.1109/TMM.2023.3291588",

language = "英语",

volume = "26",

pages = "2056--2068",

journal = "IEEE Transactions on Multimedia",

issn = "1520-9210",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model

AU - Xing, Yinghui

AU - Wu, Qirui

AU - Cheng, De

AU - Zhang, Shizhou

AU - Liang, Guoqiang

AU - Wang, Peng

AU - Zhang, Yanning

PY - 2024

Y1 - 2024

N2 - With the emergence of large pretrained vison-language models such as CLIP, transferable representations can be adapted to a wide range of downstream tasks via prompt tuning. Prompt tuning probes for beneficial information for downstream tasks from the general knowledge stored in the pretrained model. A recently proposed method named Context Optimization (CoOp) introduces a set of learnable vectors as text prompts from the language side. However, tuning the text prompt alone can only adjust the synthesized 'classifier', while the computed visual features of the image encoder cannot be affected, thus leading to suboptimal solutions. In this article, we propose a novel dual-modality prompt tuning (DPT) paradigm through learning text and visual prompts simultaneously. To make the final image feature concentrate more on the target visual concept, a class-aware visual prompt tuning (CAVPT) scheme is further proposed in our DPT. In this scheme, the class-aware visual prompt is generated dynamically by performing the cross attention between text prompt features and image patch token embeddings to encode both the downstream task-related information and visual instance information. Extensive experimental results on 11 datasets demonstrate the effectiveness and generalization ability of the proposed method.

AB - With the emergence of large pretrained vison-language models such as CLIP, transferable representations can be adapted to a wide range of downstream tasks via prompt tuning. Prompt tuning probes for beneficial information for downstream tasks from the general knowledge stored in the pretrained model. A recently proposed method named Context Optimization (CoOp) introduces a set of learnable vectors as text prompts from the language side. However, tuning the text prompt alone can only adjust the synthesized 'classifier', while the computed visual features of the image encoder cannot be affected, thus leading to suboptimal solutions. In this article, we propose a novel dual-modality prompt tuning (DPT) paradigm through learning text and visual prompts simultaneously. To make the final image feature concentrate more on the target visual concept, a class-aware visual prompt tuning (CAVPT) scheme is further proposed in our DPT. In this scheme, the class-aware visual prompt is generated dynamically by performing the cross attention between text prompt features and image patch token embeddings to encode both the downstream task-related information and visual instance information. Extensive experimental results on 11 datasets demonstrate the effectiveness and generalization ability of the proposed method.

KW - Few-shot learning

KW - image classification

KW - prompt tuning

KW - transfer learning

KW - vision-language model

UR - http://www.scopus.com/inward/record.url?scp=85164379465&partnerID=8YFLogxK

U2 - 10.1109/TMM.2023.3291588

DO - 10.1109/TMM.2023.3291588

M3 - 文章

AN - SCOPUS:85164379465

SN - 1520-9210

VL - 26

SP - 2056

EP - 2068

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

ER -

Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this