Abstract
The integration of multimodal data holds great promise for advancing road extraction in remote sensing. However, existing approaches are limited by the lack of unified end-to-end frameworks for diverse modality combinations, suboptimal multimodal feature fusion, and challenges in capturing the slender, winding, and complex topological structures of roads. In this article, we propose AutoRoadSAM, a novel end-to-end framework for multimodal road extraction that fully exploits the powerful visual representation capabilities of the segment anything model (SAM) and, for the first time, introduces an auto-prompting mechanism via a dynamic snake convolution-based decoder. This decoder adaptively generates task-specific prompts by capturing fine-grained local geometric features from auxiliary modality branches, enabling precise alignment with complex road structures. To further enhance multimodal feature fusion and topological perception, we design the cross-modal information interaction (CMII) module, which facilitates global context modeling and cross-modal interaction, while strengthening the representation of intricate road topology through multidirectional snake scanning. Moreover, we incorporate a mask decoder with cross-polarity-aware linear attention (CPLAM) to boost decoding efficiency and effectively address pixel imbalance. Together, these innovations enable AutoRoadSAM to achieve superior structure- and semantic-aware road extraction across diverse modality combinations. Extensive experiments on six public datasets and four modality combinations demonstrate that AutoRoadSAM consistently outperforms state-of-the-art methods, validating the effectiveness and generalization capability of each proposed component. The code is available at https://github.com/NWPUFranklee/AutoRoadSAM.git.
| Original language | English |
|---|---|
| Article number | 5607617 |
| Journal | IEEE Transactions on Geoscience and Remote Sensing |
| Volume | 64 |
| DOIs | |
| State | Published - 2026 |
Keywords
- Auto-prompting
- feature fusion
- multimodal remote sensing
- road extraction
- vision foundation models
Fingerprint
Dive into the research topics of 'AutoRoadSAM: Multimodal Remote Sensing Road Extraction with Structure-Semantic Awareness via Auto-Prompting Vision Foundation Models'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver