TY - JOUR
T1 - Spatial-Channel Attention Transformer With Pseudo Regions for Remote Sensing Image-Text Retrieval
AU - Wu, Dongqing
AU - Li, Huihui
AU - Hou, Yinxuan
AU - Xu, Cuili
AU - Cheng, Gong
AU - Guo, Lei
AU - Liu, Hang
N1 - Publisher Copyright:
© 1980-2012 IEEE.
PY - 2024
Y1 - 2024
N2 - Recently, remote sensing image-text retrieval (RSITR) has received significant attention due to its flexible query form and effective management of remote sensing images. However, prior work often relies on compact global features and ignores local features that can reflect salient objects in the images. Moreover, these methods primarily model interactions between features in the spatial domain, which is insufficient for mining the rich semantic information presented in remote sensing images. In this article, we propose a novel spatial-channel attention transformer (SCAT) with pseudo regions to address these issues. Concretely, in order to acquire the fine-grained perception of local objects, we introduce a pseudo region generation (PRG) module that adaptively aggregates grid features with similar semantic information into multiple clusters through a clustering algorithm. These generated cluster centers are able to flexibly and efficiently represent local objects in remote sensing images without relying on sophisticated object detectors. Furthermore, in order to achieve a comprehensive understanding of image semantics information, we carefully construct a novel SCAT. By exploiting spatial and channel attention to explore the dependencies between features at both spatial and channel domains, the proposed SCAT enhances the model's ability to identify both 'where to look' and 'what it is,' thereby obtaining a more powerful representation. In addition, SCAT incorporates two novel designs that alleviate the high overhead caused by attention modeling. Extensive experiments on two benchmark datasets, RSICD and RSITMD, fully demonstrate the effectiveness and superiority of our proposed method.
AB - Recently, remote sensing image-text retrieval (RSITR) has received significant attention due to its flexible query form and effective management of remote sensing images. However, prior work often relies on compact global features and ignores local features that can reflect salient objects in the images. Moreover, these methods primarily model interactions between features in the spatial domain, which is insufficient for mining the rich semantic information presented in remote sensing images. In this article, we propose a novel spatial-channel attention transformer (SCAT) with pseudo regions to address these issues. Concretely, in order to acquire the fine-grained perception of local objects, we introduce a pseudo region generation (PRG) module that adaptively aggregates grid features with similar semantic information into multiple clusters through a clustering algorithm. These generated cluster centers are able to flexibly and efficiently represent local objects in remote sensing images without relying on sophisticated object detectors. Furthermore, in order to achieve a comprehensive understanding of image semantics information, we carefully construct a novel SCAT. By exploiting spatial and channel attention to explore the dependencies between features at both spatial and channel domains, the proposed SCAT enhances the model's ability to identify both 'where to look' and 'what it is,' thereby obtaining a more powerful representation. In addition, SCAT incorporates two novel designs that alleviate the high overhead caused by attention modeling. Extensive experiments on two benchmark datasets, RSICD and RSITMD, fully demonstrate the effectiveness and superiority of our proposed method.
KW - Attention mechanism
KW - feature clustering
KW - remote sensing image-text retrieval (RSITR)
KW - transformer
UR - http://www.scopus.com/inward/record.url?scp=85192159258&partnerID=8YFLogxK
U2 - 10.1109/TGRS.2024.3395313
DO - 10.1109/TGRS.2024.3395313
M3 - 文章
AN - SCOPUS:85192159258
SN - 0196-2892
VL - 62
SP - 1
EP - 15
JO - IEEE Transactions on Geoscience and Remote Sensing
JF - IEEE Transactions on Geoscience and Remote Sensing
M1 - 4704115
ER -