TY - JOUR
T1 - AutoSAM
T2 - Auto-Prompting Mamba-Based Vision Foundation Model for Multimodal Remote Sensing Semantic Segmentation
AU - Li, Jiayuan
AU - Wang, Zhen
AU - Sun, Xiao
AU - Xu, Nan
AU - You, Zhuhong
AU - Huang, Deshuang
N1 - Publisher Copyright:
© 1980-2012 IEEE.
PY - 2026
Y1 - 2026
N2 - Vision foundation models, such as the segment anything model (SAM), have advanced remote sensing (RS) tasks. However, extending SAM to multimodal RS semantic segmentation faces two key challenges: 1) SAM is tailored for unimodal inputs and lacks RS-specific knowledge, hindering effective spatial modeling and cross-modal feature integration; and 2) SAM depends on externally provided prompts (e.g., points or boxes), limiting its scalability and practicality in multimodal scenarios. To address these issues, we present AutoSAM, an end-to-end auto-prompting Mamba-based vision foundation model framework tailored for multimodal RS semantic segmentation. Specifically, we introduce a CrossMamba block (CMB) in the feature extraction stage to replace the conventional multihead self-attention mechanism, where the core reverse interactive scanning adaptor-SS2D module (RISASM) promotes semantic interaction and alleviates modality discrepancies. In addition, a multimodal scale-aware fusion module (MSAFM) is incorporated to enhance scale-aware fusion and suppress irrelevant features through cascaded residual interactions. Furthermore, we propose a plug-and-play multimodal mixture-of-class-expert auto-prompting module (MMoEAPM), which enables the generation of pseudo-mask prompts for the original prompt encoder without additional training overhead, thereby supporting efficient auto-prompting. Extensive experiments and ablation studies on four benchmark multimodal RS datasets demonstrate that AutoSAM consistently achieves state-of-the-art performance across diverse modality combinations.
AB - Vision foundation models, such as the segment anything model (SAM), have advanced remote sensing (RS) tasks. However, extending SAM to multimodal RS semantic segmentation faces two key challenges: 1) SAM is tailored for unimodal inputs and lacks RS-specific knowledge, hindering effective spatial modeling and cross-modal feature integration; and 2) SAM depends on externally provided prompts (e.g., points or boxes), limiting its scalability and practicality in multimodal scenarios. To address these issues, we present AutoSAM, an end-to-end auto-prompting Mamba-based vision foundation model framework tailored for multimodal RS semantic segmentation. Specifically, we introduce a CrossMamba block (CMB) in the feature extraction stage to replace the conventional multihead self-attention mechanism, where the core reverse interactive scanning adaptor-SS2D module (RISASM) promotes semantic interaction and alleviates modality discrepancies. In addition, a multimodal scale-aware fusion module (MSAFM) is incorporated to enhance scale-aware fusion and suppress irrelevant features through cascaded residual interactions. Furthermore, we propose a plug-and-play multimodal mixture-of-class-expert auto-prompting module (MMoEAPM), which enables the generation of pseudo-mask prompts for the original prompt encoder without additional training overhead, thereby supporting efficient auto-prompting. Extensive experiments and ablation studies on four benchmark multimodal RS datasets demonstrate that AutoSAM consistently achieves state-of-the-art performance across diverse modality combinations.
KW - Auto-prompting strategy
KW - multimodal remote sensing (RS)
KW - semantic segmentation
KW - vision foundation model
UR - https://www.scopus.com/pages/publications/105033545970
U2 - 10.1109/TGRS.2026.3667690
DO - 10.1109/TGRS.2026.3667690
M3 - 文章
AN - SCOPUS:105033545970
SN - 0196-2892
VL - 64
JO - IEEE Transactions on Geoscience and Remote Sensing
JF - IEEE Transactions on Geoscience and Remote Sensing
M1 - 5612421
ER -