TY - JOUR
T1 - CLIP2RS
T2 - Leveraging Pretrained Vision-Language Model for Semantic Segmentation of Remote Sensing Images
AU - Xing, Yinghui
AU - Kong, Dexuan
AU - Zhang, Shizhou
AU - Li, Ziyi
AU - Li, Qingyi
AU - Zhang, Yanning
N1 - Publisher Copyright:
© 1980-2012 IEEE.
PY - 2026
Y1 - 2026
N2 - Semantic segmentation of remote sensing (RS) images is a very challenging task due to the complicated characteristics such as diversity, complexity, and massiveness. Current research endeavors are predominantly centered on utilizing visual context information exclusively through meticulous architecture design, often overlooking significant semantic details. This oversight limits the efficacy in tackling the challenge of intraclass variations. In this article, we propose CLIP2RS, which is devised to leverage the pretrained vision-language model (VLM) for the semantic segmentation of RS images via the guidance of prior knowledge stored in the pretrained foundation model. Specifically, CLIP2RS utilizes a two-stage training strategy to overcome the domain gap challenge between natural images and RS images. A dual-granularity alignment framework that simultaneously aligns pixel-level local features and image-level global features is designed to alleviate the severe class sample imbalance problem. In addition, a novel prompting mechanism is effectively explored to fully harness the potential of CLIP textual descriptions. We conduct comprehensive experiments on the iSAID, Potsdam, and Vaihingen datasets, and the experimental results show that our proposed method achieves state-of-the-art performances, demonstrating its superiority.
AB - Semantic segmentation of remote sensing (RS) images is a very challenging task due to the complicated characteristics such as diversity, complexity, and massiveness. Current research endeavors are predominantly centered on utilizing visual context information exclusively through meticulous architecture design, often overlooking significant semantic details. This oversight limits the efficacy in tackling the challenge of intraclass variations. In this article, we propose CLIP2RS, which is devised to leverage the pretrained vision-language model (VLM) for the semantic segmentation of RS images via the guidance of prior knowledge stored in the pretrained foundation model. Specifically, CLIP2RS utilizes a two-stage training strategy to overcome the domain gap challenge between natural images and RS images. A dual-granularity alignment framework that simultaneously aligns pixel-level local features and image-level global features is designed to alleviate the severe class sample imbalance problem. In addition, a novel prompting mechanism is effectively explored to fully harness the potential of CLIP textual descriptions. We conduct comprehensive experiments on the iSAID, Potsdam, and Vaihingen datasets, and the experimental results show that our proposed method achieves state-of-the-art performances, demonstrating its superiority.
KW - Remote sensing (RS)
KW - semantic segmentation
KW - vision-language model (VLM)
UR - https://www.scopus.com/pages/publications/105025730317
U2 - 10.1109/TGRS.2025.3647015
DO - 10.1109/TGRS.2025.3647015
M3 - 文章
AN - SCOPUS:105025730317
SN - 0196-2892
VL - 64
JO - IEEE Transactions on Geoscience and Remote Sensing
JF - IEEE Transactions on Geoscience and Remote Sensing
M1 - 5604116
ER -