TY - JOUR
T1 - Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval
AU - Zhang, Shun
AU - Li, Yupeng
AU - Mei, Shaohui
N1 - Publisher Copyright:
© 1980-2012 IEEE.
PY - 2023
Y1 - 2023
N2 - Remote sensing cross-modal text-image retrieval (RSCTIR) has recently received unprecedented attention due to its advantages of flexible input and efficient query on enormous remote sensing (RS) images. However, most RSCTIR methods focus obsessively on the cross-modal semantic alignment between text and image modalities and are easily stuck in the dilemma of information redundancy, leading to the degradation of retrieval accuracy. To address the issues, we construct a novel RSCTIR framework based on mask-guided relation modeling with entity loss (MGRM-EL), to fully explore uni-modal feature learning on entities and relations in the cross-modal model learning process. Specifically, we take advantage of the Transformer encoder architecture for its ability to capture long-distance dependencies from the global view and build two uni-modal (visual and textual) Transformer encoders combined with convolutional neural network (CNN) to extract the spatial interregion relations of images as well as long-term interword relations of texts for prominent feature embedding of visual and semantic representations. A mask-guided attention strategy is further introduced to learn the salient regions and words, with the aim of enhancing the RSCTIR model's uni-modal learning ability and eliminating unnecessary and redundant information about each modality. Unlike existing methods that simply compute the semantic similarity between images and texts on the loss functions, we present a novel uni-modal entity loss, which treats each image as an image entity and merges similar texts into a text entity, to learn the independent distribution of entities in each modality. We conduct extensive experiments on public RSCTIR benchmarks including RSICD and RSITMD datasets, which demonstrate the state-of-the-art performance of the proposed method on the RSCTIR task.
AB - Remote sensing cross-modal text-image retrieval (RSCTIR) has recently received unprecedented attention due to its advantages of flexible input and efficient query on enormous remote sensing (RS) images. However, most RSCTIR methods focus obsessively on the cross-modal semantic alignment between text and image modalities and are easily stuck in the dilemma of information redundancy, leading to the degradation of retrieval accuracy. To address the issues, we construct a novel RSCTIR framework based on mask-guided relation modeling with entity loss (MGRM-EL), to fully explore uni-modal feature learning on entities and relations in the cross-modal model learning process. Specifically, we take advantage of the Transformer encoder architecture for its ability to capture long-distance dependencies from the global view and build two uni-modal (visual and textual) Transformer encoders combined with convolutional neural network (CNN) to extract the spatial interregion relations of images as well as long-term interword relations of texts for prominent feature embedding of visual and semantic representations. A mask-guided attention strategy is further introduced to learn the salient regions and words, with the aim of enhancing the RSCTIR model's uni-modal learning ability and eliminating unnecessary and redundant information about each modality. Unlike existing methods that simply compute the semantic similarity between images and texts on the loss functions, we present a novel uni-modal entity loss, which treats each image as an image entity and merges similar texts into a text entity, to learn the independent distribution of entities in each modality. We conduct extensive experiments on public RSCTIR benchmarks including RSICD and RSITMD datasets, which demonstrate the state-of-the-art performance of the proposed method on the RSCTIR task.
KW - Mask-guided attention strategy
KW - relation exploration
KW - remote sensing cross-modal text-image retrieval (RSCTIR)
KW - uni-modal entity loss
UR - http://www.scopus.com/inward/record.url?scp=85178066904&partnerID=8YFLogxK
U2 - 10.1109/TGRS.2023.3333375
DO - 10.1109/TGRS.2023.3333375
M3 - 文章
AN - SCOPUS:85178066904
SN - 0196-2892
VL - 61
JO - IEEE Transactions on Geoscience and Remote Sensing
JF - IEEE Transactions on Geoscience and Remote Sensing
M1 - 5626317
ER -