Abstract
Semantic segmentation of Remote Sensing (RS) images is a very challenging task due to the complicated characteristics such as diversity, complexity and massiveness. Current research endeavors are predominantly centered on utilizing visual context information exclusively through meticulous architecture design, often overlooking significant semantic details. This oversight limits the efficacy in tackling the challenge of intra-class variations. While in this paper, we propose CLIP2RS which is devised to leverage the pretrained Vision-Language Model (VLM) for semantic segmentation of RS images via the guidance of prior knowledge stored in the pretrained foundation model. Specifically, CLIP2RS utilizes a two-stage training strategy to overcome the domain gap challenge between natural images and remote sensing images. A dual-granularity alignment framework that simultaneously aligns pixel-level local features and image-level global features is designed to alleviate severe class sample imbalance problem. Additionally, a novel prompting mechanism is effectively explored to to fully harness the potential of CLIP textual descriptions. We conduct comprehensive experiments on the iSAID, Potsdam, and Vaihingen datasets, and the experimental results show that our proposed method achieves state-of-the-art performances, demonstrating its superiority.
| Original language | English |
|---|---|
| Journal | IEEE Transactions on Geoscience and Remote Sensing |
| DOIs | |
| State | Accepted/In press - 2025 |
Keywords
- Remote Sensing
- Semantic Segmentation
- Vision-Language Model
Fingerprint
Dive into the research topics of 'CLIP2RS: Leveraging Pretrained Vision-Language Model for Semantic Segmentation of Remote Sensing Images'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver