WeakCLIP: Adapting CLIP for Weakly-Supervised Semantic Segmentation

Lianghui Zhu; Xinggang Wang; Jiapei Feng; Tianheng Cheng; Yingyue Li; Bo Jiang; Dingwen Zhang; Junwei Han

doi:10.1007/s11263-024-02224-2

WeakCLIP: Adapting CLIP for Weakly-Supervised Semantic Segmentation

Lianghui Zhu, Xinggang Wang, Jiapei Feng, Tianheng Cheng, Yingyue Li, Bo Jiang, Dingwen Zhang, Junwei Han

自动化学院

Huazhong University of Science and Technology

科研成果: 期刊稿件 › 文章 › 同行评审

2 引用（Scopus）

摘要

Contrastive language and image pre-training (CLIP) achieves great success in various computer vision tasks and also presents an opportune avenue for enhancing weakly-supervised image understanding with its large-scale pre-trained knowledge. As an effective way to reduce the reliance on pixel-level human-annotated labels, weakly-supervised semantic segmentation (WSSS) aims to refine the class activation map (CAM) and produce high-quality pseudo masks. Weakly-supervised semantic segmentation (WSSS) aims to refine the class activation map (CAM) as pseudo masks, but heavily relies on inductive biases like hand-crafted priors and digital image processing methods. For the vision-language pre-trained model, i.e. CLIP, we propose a novel text-to-pixel matching paradigm for WSSS. However, directly applying CLIP to WSSS is challenging due to three critical problems: (1) the task gap between contrastive pre-training and WSSS CAM refinement, (2) lacking text-to-pixel modeling to fully utilize the pre-trained knowledge, and (3) the insufficient details owning to the 116 down-sampling resolution of ViT. Thus, we propose WeakCLIP to address the problems and leverage the pre-trained knowledge from CLIP to WSSS. Specifically, we first address the task gap by proposing a pyramid adapter and learnable prompts to extract WSSS-specific representation. We then design a co-attention matching module to model text-to-pixel relationships. Finally, the pyramid adapter and text-guided decoder are introduced to gather multi-level information and integrate it with text guidance hierarchically. WeakCLIP provides an effective and parameter-efficient way to transfer CLIP knowledge to refine CAM. Extensive experiments demonstrate that WeakCLIP achieves the state-of-the-art WSSS performance on standard benchmarks, i.e., 74.0% mIoU on the val set of PASCAL VOC 2012 and 46.1% mIoU on the val set of COCO 2014. The source code and model checkpoints are released at https://github.com/hustvl/WeakCLIP.

源语言	英语
页（从-至）	1085-1105
页数	21
期刊	International Journal of Computer Vision
卷	133
期	3
DOI	https://doi.org/10.1007/s11263-024-02224-2
出版状态	已出版 - 3月 2025

访问文件

10.1007/s11263-024-02224-2

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{21026f944f464ea48101889344fb1955,

title = "WeakCLIP: Adapting CLIP for Weakly-Supervised Semantic Segmentation",

abstract = "Contrastive language and image pre-training (CLIP) achieves great success in various computer vision tasks and also presents an opportune avenue for enhancing weakly-supervised image understanding with its large-scale pre-trained knowledge. As an effective way to reduce the reliance on pixel-level human-annotated labels, weakly-supervised semantic segmentation (WSSS) aims to refine the class activation map (CAM) and produce high-quality pseudo masks. Weakly-supervised semantic segmentation (WSSS) aims to refine the class activation map (CAM) as pseudo masks, but heavily relies on inductive biases like hand-crafted priors and digital image processing methods. For the vision-language pre-trained model, i.e. CLIP, we propose a novel text-to-pixel matching paradigm for WSSS. However, directly applying CLIP to WSSS is challenging due to three critical problems: (1) the task gap between contrastive pre-training and WSSS CAM refinement, (2) lacking text-to-pixel modeling to fully utilize the pre-trained knowledge, and (3) the insufficient details owning to the 116 down-sampling resolution of ViT. Thus, we propose WeakCLIP to address the problems and leverage the pre-trained knowledge from CLIP to WSSS. Specifically, we first address the task gap by proposing a pyramid adapter and learnable prompts to extract WSSS-specific representation. We then design a co-attention matching module to model text-to-pixel relationships. Finally, the pyramid adapter and text-guided decoder are introduced to gather multi-level information and integrate it with text guidance hierarchically. WeakCLIP provides an effective and parameter-efficient way to transfer CLIP knowledge to refine CAM. Extensive experiments demonstrate that WeakCLIP achieves the state-of-the-art WSSS performance on standard benchmarks, i.e., 74.0% mIoU on the val set of PASCAL VOC 2012 and 46.1% mIoU on the val set of COCO 2014. The source code and model checkpoints are released at https://github.com/hustvl/WeakCLIP.",

keywords = "CAM refinement, CLIP, Semantic segmentation, Weakly-supervised Learning",

author = "Lianghui Zhu and Xinggang Wang and Jiapei Feng and Tianheng Cheng and Yingyue Li and Bo Jiang and Dingwen Zhang and Junwei Han",

note = "Publisher Copyright: {\textcopyright} The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.",

year = "2025",

month = mar,

doi = "10.1007/s11263-024-02224-2",

language = "英语",

volume = "133",

pages = "1085--1105",

journal = "International Journal of Computer Vision",

issn = "0920-5691",

publisher = "Springer Netherlands",

number = "3",

}

TY - JOUR

T1 - WeakCLIP

T2 - Adapting CLIP for Weakly-Supervised Semantic Segmentation

AU - Zhu, Lianghui

AU - Wang, Xinggang

AU - Feng, Jiapei

AU - Cheng, Tianheng

AU - Li, Yingyue

AU - Jiang, Bo

AU - Zhang, Dingwen

AU - Han, Junwei

N1 - Publisher Copyright: © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.

PY - 2025/3

Y1 - 2025/3

N2 - Contrastive language and image pre-training (CLIP) achieves great success in various computer vision tasks and also presents an opportune avenue for enhancing weakly-supervised image understanding with its large-scale pre-trained knowledge. As an effective way to reduce the reliance on pixel-level human-annotated labels, weakly-supervised semantic segmentation (WSSS) aims to refine the class activation map (CAM) and produce high-quality pseudo masks. Weakly-supervised semantic segmentation (WSSS) aims to refine the class activation map (CAM) as pseudo masks, but heavily relies on inductive biases like hand-crafted priors and digital image processing methods. For the vision-language pre-trained model, i.e. CLIP, we propose a novel text-to-pixel matching paradigm for WSSS. However, directly applying CLIP to WSSS is challenging due to three critical problems: (1) the task gap between contrastive pre-training and WSSS CAM refinement, (2) lacking text-to-pixel modeling to fully utilize the pre-trained knowledge, and (3) the insufficient details owning to the 116 down-sampling resolution of ViT. Thus, we propose WeakCLIP to address the problems and leverage the pre-trained knowledge from CLIP to WSSS. Specifically, we first address the task gap by proposing a pyramid adapter and learnable prompts to extract WSSS-specific representation. We then design a co-attention matching module to model text-to-pixel relationships. Finally, the pyramid adapter and text-guided decoder are introduced to gather multi-level information and integrate it with text guidance hierarchically. WeakCLIP provides an effective and parameter-efficient way to transfer CLIP knowledge to refine CAM. Extensive experiments demonstrate that WeakCLIP achieves the state-of-the-art WSSS performance on standard benchmarks, i.e., 74.0% mIoU on the val set of PASCAL VOC 2012 and 46.1% mIoU on the val set of COCO 2014. The source code and model checkpoints are released at https://github.com/hustvl/WeakCLIP.

AB - Contrastive language and image pre-training (CLIP) achieves great success in various computer vision tasks and also presents an opportune avenue for enhancing weakly-supervised image understanding with its large-scale pre-trained knowledge. As an effective way to reduce the reliance on pixel-level human-annotated labels, weakly-supervised semantic segmentation (WSSS) aims to refine the class activation map (CAM) and produce high-quality pseudo masks. Weakly-supervised semantic segmentation (WSSS) aims to refine the class activation map (CAM) as pseudo masks, but heavily relies on inductive biases like hand-crafted priors and digital image processing methods. For the vision-language pre-trained model, i.e. CLIP, we propose a novel text-to-pixel matching paradigm for WSSS. However, directly applying CLIP to WSSS is challenging due to three critical problems: (1) the task gap between contrastive pre-training and WSSS CAM refinement, (2) lacking text-to-pixel modeling to fully utilize the pre-trained knowledge, and (3) the insufficient details owning to the 116 down-sampling resolution of ViT. Thus, we propose WeakCLIP to address the problems and leverage the pre-trained knowledge from CLIP to WSSS. Specifically, we first address the task gap by proposing a pyramid adapter and learnable prompts to extract WSSS-specific representation. We then design a co-attention matching module to model text-to-pixel relationships. Finally, the pyramid adapter and text-guided decoder are introduced to gather multi-level information and integrate it with text guidance hierarchically. WeakCLIP provides an effective and parameter-efficient way to transfer CLIP knowledge to refine CAM. Extensive experiments demonstrate that WeakCLIP achieves the state-of-the-art WSSS performance on standard benchmarks, i.e., 74.0% mIoU on the val set of PASCAL VOC 2012 and 46.1% mIoU on the val set of COCO 2014. The source code and model checkpoints are released at https://github.com/hustvl/WeakCLIP.

KW - CAM refinement

KW - CLIP

KW - Semantic segmentation

KW - Weakly-supervised Learning

UR - http://www.scopus.com/inward/record.url?scp=85203283049&partnerID=8YFLogxK

U2 - 10.1007/s11263-024-02224-2

DO - 10.1007/s11263-024-02224-2

M3 - 文章

AN - SCOPUS:85203283049

SN - 0920-5691

VL - 133

SP - 1085

EP - 1105

JO - International Journal of Computer Vision

JF - International Journal of Computer Vision

IS - 3

ER -

WeakCLIP: Adapting CLIP for Weakly-Supervised Semantic Segmentation

摘要

访问文件

其它文件与链接

指纹

引用此