Abstract
Vision foundation models, such as the segment anything model (SAM), have advanced remote sensing (RS) tasks. However, extending SAM to multimodal RS semantic segmentation faces two key challenges: 1) SAM is tailored for unimodal inputs and lacks RS-specific knowledge, hindering effective spatial modeling and cross-modal feature integration; and 2) SAM depends on externally provided prompts (e.g., points or boxes), limiting its scalability and practicality in multimodal scenarios. To address these issues, we present AutoSAM, an end-to-end auto-prompting Mamba-based vision foundation model framework tailored for multimodal RS semantic segmentation. Specifically, we introduce a CrossMamba block (CMB) in the feature extraction stage to replace the conventional multihead self-attention mechanism, where the core reverse interactive scanning adaptor-SS2D module (RISASM) promotes semantic interaction and alleviates modality discrepancies. In addition, a multimodal scale-aware fusion module (MSAFM) is incorporated to enhance scale-aware fusion and suppress irrelevant features through cascaded residual interactions. Furthermore, we propose a plug-and-play multimodal mixture-of-class-expert auto-prompting module (MMoEAPM), which enables the generation of pseudo-mask prompts for the original prompt encoder without additional training overhead, thereby supporting efficient auto-prompting. Extensive experiments and ablation studies on four benchmark multimodal RS datasets demonstrate that AutoSAM consistently achieves state-of-the-art performance across diverse modality combinations.
| Original language | English |
|---|---|
| Article number | 5612421 |
| Journal | IEEE Transactions on Geoscience and Remote Sensing |
| Volume | 64 |
| DOIs | |
| State | Published - 2026 |
Keywords
- Auto-prompting strategy
- multimodal remote sensing (RS)
- semantic segmentation
- vision foundation model
Fingerprint
Dive into the research topics of 'AutoSAM: Auto-Prompting Mamba-Based Vision Foundation Model for Multimodal Remote Sensing Semantic Segmentation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver