TY - JOUR
T1 - Entity-Guided Attention Twisting Network for Referring Remote Sensing Image Segmentation
AU - Jia, Yuyu
AU - Zhou, Qing
AU - Gao, Junyu
AU - Wang, Qi
N1 - Publisher Copyright:
© 1980-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Referring remote sensing image segmentation (RRSIS) aims to establish pixel-level interpretation of specific regions queried by textual expressions, bridging textual semantics and intelligent analysis of remote sensing imagery. In contrast to natural scenarios, the intricate backgrounds in remote sensing scenarios result in low target–background contrast, often leading to semantic dispersion in segmented regions. Furthermore, conventional cross-attention-based referring image segmentation (RIS) methods struggle to bridge the modal gap, hindering fine-grained alignment between linguistic descriptions and geographical features. To overcome these challenges, we present a pioneering entity-guided attention twisting network (Enti-TwistNet) for RRSIS. Our framework first introduces a segment anything model (SAM)-inspired entity guidance (SEG) module that extracts spatially constrained entity prompts through a self-reasoning mask generation mechanism, constructing a comprehensive entity-visual–text tri-modal information cube. Subsequently, during cross-modal interaction, we propose a dual-phase attention-twisting (DAT) mechanism: 1) initially, sequential channel-wise scanning to facilitate cross-modal semantic propagation (SP) and 2) subsequently, twist attention to the spatial dimension, integrating entity guidance to enhance the representation of irregular geographic boundaries. Extensive experiments on two widely used benchmarks, RefSegRS and RRSIS-D, demonstrate that the Enti-TwistNet achieves significant performance improvements over existing state-of-the-art models.
AB - Referring remote sensing image segmentation (RRSIS) aims to establish pixel-level interpretation of specific regions queried by textual expressions, bridging textual semantics and intelligent analysis of remote sensing imagery. In contrast to natural scenarios, the intricate backgrounds in remote sensing scenarios result in low target–background contrast, often leading to semantic dispersion in segmented regions. Furthermore, conventional cross-attention-based referring image segmentation (RIS) methods struggle to bridge the modal gap, hindering fine-grained alignment between linguistic descriptions and geographical features. To overcome these challenges, we present a pioneering entity-guided attention twisting network (Enti-TwistNet) for RRSIS. Our framework first introduces a segment anything model (SAM)-inspired entity guidance (SEG) module that extracts spatially constrained entity prompts through a self-reasoning mask generation mechanism, constructing a comprehensive entity-visual–text tri-modal information cube. Subsequently, during cross-modal interaction, we propose a dual-phase attention-twisting (DAT) mechanism: 1) initially, sequential channel-wise scanning to facilitate cross-modal semantic propagation (SP) and 2) subsequently, twist attention to the spatial dimension, integrating entity guidance to enhance the representation of irregular geographic boundaries. Extensive experiments on two widely used benchmarks, RefSegRS and RRSIS-D, demonstrate that the Enti-TwistNet achieves significant performance improvements over existing state-of-the-art models.
KW - Attention twisting
KW - entity-aware guidance
KW - referring segmentation
KW - remote sensing
UR - https://www.scopus.com/pages/publications/105018335119
U2 - 10.1109/TGRS.2025.3615765
DO - 10.1109/TGRS.2025.3615765
M3 - 文章
AN - SCOPUS:105018335119
SN - 0196-2892
VL - 63
JO - IEEE Transactions on Geoscience and Remote Sensing
JF - IEEE Transactions on Geoscience and Remote Sensing
M1 - 5645610
ER -