TY - JOUR
T1 - Hierarchical textual-visual guidance for referring remote sensing segmentation
AU - Zhang, Shiyu
AU - Zhou, Qing
AU - Wang, Qi
AU - Yuan, Yuan
AU - Gao, Junyu
N1 - Publisher Copyright:
© 2026 Elsevier Ltd
PY - 2026/11
Y1 - 2026/11
N2 - Referring Remote Sensing Image Segmentation (RRSIS) aims to precisely segment regions in remote sensing images based on natural language expressions. However, a central challenge lies in language-visual ambiguity, as remote sensing expressions often involve property-dense functional categories and implicit spatial relations, while the corresponding images simultaneously present substantial scale variation and intricate spatial layouts. Existing methods struggle to effectively ground complex textual semantics within intricate remote sensing images. To address this challenge, we propose a method from the perspective of hierarchical textual-visual guidance. Specifically, we design a Textual Semantic Parsing Module (TSPM), which disambiguates complex referring expressions by transforming them into hierarchical attributes encompassing category recognition, spatial constraints, relational semantics, and intrinsic properties, thereby providing explicit cues for visual grounding. Building upon these structured cues, we further develop an Adaptive Visual-aware Modulation Module (AVMM), which integrates Dual-Path hierarchical Visual Feature Extraction and Dynamic Convolutional Perception Mechanism to adaptively modulate features under the hierarchical textual guidance from TSPM. Through the joint effect of TSPM and AVMM, our approach effectively bridges the gap caused by language-visual ambiguity. The proposed method is evaluated on two public RRSIS datasets, achieving state-of-the-art performance with mIoU scores of 68.81% on RefSegRS and 64.82% on RRSIS-D.
AB - Referring Remote Sensing Image Segmentation (RRSIS) aims to precisely segment regions in remote sensing images based on natural language expressions. However, a central challenge lies in language-visual ambiguity, as remote sensing expressions often involve property-dense functional categories and implicit spatial relations, while the corresponding images simultaneously present substantial scale variation and intricate spatial layouts. Existing methods struggle to effectively ground complex textual semantics within intricate remote sensing images. To address this challenge, we propose a method from the perspective of hierarchical textual-visual guidance. Specifically, we design a Textual Semantic Parsing Module (TSPM), which disambiguates complex referring expressions by transforming them into hierarchical attributes encompassing category recognition, spatial constraints, relational semantics, and intrinsic properties, thereby providing explicit cues for visual grounding. Building upon these structured cues, we further develop an Adaptive Visual-aware Modulation Module (AVMM), which integrates Dual-Path hierarchical Visual Feature Extraction and Dynamic Convolutional Perception Mechanism to adaptively modulate features under the hierarchical textual guidance from TSPM. Through the joint effect of TSPM and AVMM, our approach effectively bridges the gap caused by language-visual ambiguity. The proposed method is evaluated on two public RRSIS datasets, achieving state-of-the-art performance with mIoU scores of 68.81% on RefSegRS and 64.82% on RRSIS-D.
KW - Hierarchical textual-visual guidance
KW - Referring image segmentation
KW - Remote sensing
UR - https://www.scopus.com/pages/publications/105034378389
U2 - 10.1016/j.patcog.2026.113579
DO - 10.1016/j.patcog.2026.113579
M3 - 文章
AN - SCOPUS:105034378389
SN - 0031-3203
VL - 179
JO - Pattern Recognition
JF - Pattern Recognition
M1 - 113579
ER -