Skip to main navigation Skip to search Skip to main content

AutoSAM: Auto-Prompting Mamba-Based Vision Foundation Model for Multimodal Remote Sensing Semantic Segmentation

  • Jiayuan Li
  • , Zhen Wang
  • , Xiao Sun
  • , Nan Xu
  • , Zhuhong You
  • , Deshuang Huang
  • Northwestern Polytechnical University Xian
  • Xijing University
  • Hohai University
  • Guangxi Academy of Agricultural Sciences

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

Vision foundation models, such as the segment anything model (SAM), have advanced remote sensing (RS) tasks. However, extending SAM to multimodal RS semantic segmentation faces two key challenges: 1) SAM is tailored for unimodal inputs and lacks RS-specific knowledge, hindering effective spatial modeling and cross-modal feature integration; and 2) SAM depends on externally provided prompts (e.g., points or boxes), limiting its scalability and practicality in multimodal scenarios. To address these issues, we present AutoSAM, an end-to-end auto-prompting Mamba-based vision foundation model framework tailored for multimodal RS semantic segmentation. Specifically, we introduce a CrossMamba block (CMB) in the feature extraction stage to replace the conventional multihead self-attention mechanism, where the core reverse interactive scanning adaptor-SS2D module (RISASM) promotes semantic interaction and alleviates modality discrepancies. In addition, a multimodal scale-aware fusion module (MSAFM) is incorporated to enhance scale-aware fusion and suppress irrelevant features through cascaded residual interactions. Furthermore, we propose a plug-and-play multimodal mixture-of-class-expert auto-prompting module (MMoEAPM), which enables the generation of pseudo-mask prompts for the original prompt encoder without additional training overhead, thereby supporting efficient auto-prompting. Extensive experiments and ablation studies on four benchmark multimodal RS datasets demonstrate that AutoSAM consistently achieves state-of-the-art performance across diverse modality combinations.

Original languageEnglish
Article number5612421
JournalIEEE Transactions on Geoscience and Remote Sensing
Volume64
DOIs
StatePublished - 2026

Keywords

  • Auto-prompting strategy
  • multimodal remote sensing (RS)
  • semantic segmentation
  • vision foundation model

Fingerprint

Dive into the research topics of 'AutoSAM: Auto-Prompting Mamba-Based Vision Foundation Model for Multimodal Remote Sensing Semantic Segmentation'. Together they form a unique fingerprint.

Cite this