跳到主要导航 跳到搜索 跳到主要内容

CLIP2RS: Leveraging Pretrained Vision-Language Model for Semantic Segmentation of Remote Sensing Images

  • Yinghui Xing
  • , Dexuan Kong
  • , Shizhou Zhang
  • , Ziyi Li
  • , Qingyi Li
  • , Yanning Zhang
  • Northwestern Polytechnical University Xian

科研成果: 期刊稿件文章同行评审

摘要

Semantic segmentation of remote sensing (RS) images is a very challenging task due to the complicated characteristics such as diversity, complexity, and massiveness. Current research endeavors are predominantly centered on utilizing visual context information exclusively through meticulous architecture design, often overlooking significant semantic details. This oversight limits the efficacy in tackling the challenge of intraclass variations. In this article, we propose CLIP2RS, which is devised to leverage the pretrained vision-language model (VLM) for the semantic segmentation of RS images via the guidance of prior knowledge stored in the pretrained foundation model. Specifically, CLIP2RS utilizes a two-stage training strategy to overcome the domain gap challenge between natural images and RS images. A dual-granularity alignment framework that simultaneously aligns pixel-level local features and image-level global features is designed to alleviate the severe class sample imbalance problem. In addition, a novel prompting mechanism is effectively explored to fully harness the potential of CLIP textual descriptions. We conduct comprehensive experiments on the iSAID, Potsdam, and Vaihingen datasets, and the experimental results show that our proposed method achieves state-of-the-art performances, demonstrating its superiority.

源语言英语
文章编号5604116
期刊IEEE Transactions on Geoscience and Remote Sensing
64
DOI
出版状态已出版 - 2026

指纹

探究 'CLIP2RS: Leveraging Pretrained Vision-Language Model for Semantic Segmentation of Remote Sensing Images' 的科研主题。它们共同构成独一无二的指纹。

引用此