TY - JOUR
T1 - Implicit CLIP Prior Decoupling for Few-Shot Remote Sensing Image Segmentation
AU - Jiang, Zhiyu
AU - Yuan, Ye
AU - Ma, Dandan
AU - Wang, Qi
AU - Yuan, Yuan
N1 - Publisher Copyright:
© 1980-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Few-shot segmentation (FSS) in remote sensing aims to achieve segmentation of novel categories in query images using limited annotated support images. Despite extensive research, the significant intraclass differences of remote sensing targets continue to hinder progress in this field. Pretrained vision–language models (VLMs) possess strong generalization capabilities, and their cross-modal information can effectively mitigate intraclass variance issues. However, VLMs rarely focus on dense prediction tasks, and the complexity of remote sensing imagery limits the effectiveness of existing attempts on FSS tasks. To address this issue, this article proposes an implicit contrastive language-image pretraining (CLIP) prior decoupling network (ICPD-Net), which mines effective cross-modal priors from VLMs and leverages ranking information to improve visual metric strategies. Specifically, the implicit prior decoupling module (IPDM) utilizes ambiguous foreground–background vision–language similarities to construct class-agnostic prompts, while employing a prior learner to mine implicit vision–language priors that alleviate intraclass differences. To fully leverage crossmodal information, the reliable feature fusion module (RFFM) utilizes vision–language priors to obtain high-confidence query features for fusion with support features and further mitigates intraclass differences through a self-support paradigm. Finally, the dual visual priors module (DVPM) introduces a novel rank information prior for visual feature measurement. This approach constructs an effective metric learning method by combining the ranking relationships of Euclidean distances between supportquery features with the normalized discounted cumulative gain (NDCG) algorithm, while comprehensively exploring visual metric relationships through a traditional cosine similarity prior. Extensive experiments on iSAID-5i and DLRSD-5i demonstrate that our method achieves significant improvements. Particularly under the one-shot setting, our approach shows exceptional effectiveness, outperforming state-of-the-art methods by up to 11.48%
AB - Few-shot segmentation (FSS) in remote sensing aims to achieve segmentation of novel categories in query images using limited annotated support images. Despite extensive research, the significant intraclass differences of remote sensing targets continue to hinder progress in this field. Pretrained vision–language models (VLMs) possess strong generalization capabilities, and their cross-modal information can effectively mitigate intraclass variance issues. However, VLMs rarely focus on dense prediction tasks, and the complexity of remote sensing imagery limits the effectiveness of existing attempts on FSS tasks. To address this issue, this article proposes an implicit contrastive language-image pretraining (CLIP) prior decoupling network (ICPD-Net), which mines effective cross-modal priors from VLMs and leverages ranking information to improve visual metric strategies. Specifically, the implicit prior decoupling module (IPDM) utilizes ambiguous foreground–background vision–language similarities to construct class-agnostic prompts, while employing a prior learner to mine implicit vision–language priors that alleviate intraclass differences. To fully leverage crossmodal information, the reliable feature fusion module (RFFM) utilizes vision–language priors to obtain high-confidence query features for fusion with support features and further mitigates intraclass differences through a self-support paradigm. Finally, the dual visual priors module (DVPM) introduces a novel rank information prior for visual feature measurement. This approach constructs an effective metric learning method by combining the ranking relationships of Euclidean distances between supportquery features with the normalized discounted cumulative gain (NDCG) algorithm, while comprehensively exploring visual metric relationships through a traditional cosine similarity prior. Extensive experiments on iSAID-5i and DLRSD-5i demonstrate that our method achieves significant improvements. Particularly under the one-shot setting, our approach shows exceptional effectiveness, outperforming state-of-the-art methods by up to 11.48%
KW - Cross-modal learning
KW - few-shot segmentation (FSS)
KW - metric learning
KW - remote sensing
UR - https://www.scopus.com/pages/publications/105018108971
U2 - 10.1109/TGRS.2025.3617662
DO - 10.1109/TGRS.2025.3617662
M3 - 文章
AN - SCOPUS:105018108971
SN - 0196-2892
VL - 63
JO - IEEE Transactions on Geoscience and Remote Sensing
JF - IEEE Transactions on Geoscience and Remote Sensing
M1 - 5646813
ER -