Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval

Shun Zhang; Yupeng Li; Shaohui Mei

doi:10.1109/TGRS.2023.3333375

Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval

Shun Zhang, Yupeng Li, Shaohui Mei

School of Electronics and Information

Northwestern Polytechnical University Xian

Research output: Contribution to journal › Article › peer-review

18 Scopus citations

Abstract

Remote sensing cross-modal text-image retrieval (RSCTIR) has recently received unprecedented attention due to its advantages of flexible input and efficient query on enormous remote sensing (RS) images. However, most RSCTIR methods focus obsessively on the cross-modal semantic alignment between text and image modalities and are easily stuck in the dilemma of information redundancy, leading to the degradation of retrieval accuracy. To address the issues, we construct a novel RSCTIR framework based on mask-guided relation modeling with entity loss (MGRM-EL), to fully explore uni-modal feature learning on entities and relations in the cross-modal model learning process. Specifically, we take advantage of the Transformer encoder architecture for its ability to capture long-distance dependencies from the global view and build two uni-modal (visual and textual) Transformer encoders combined with convolutional neural network (CNN) to extract the spatial interregion relations of images as well as long-term interword relations of texts for prominent feature embedding of visual and semantic representations. A mask-guided attention strategy is further introduced to learn the salient regions and words, with the aim of enhancing the RSCTIR model's uni-modal learning ability and eliminating unnecessary and redundant information about each modality. Unlike existing methods that simply compute the semantic similarity between images and texts on the loss functions, we present a novel uni-modal entity loss, which treats each image as an image entity and merges similar texts into a text entity, to learn the independent distribution of entities in each modality. We conduct extensive experiments on public RSCTIR benchmarks including RSICD and RSITMD datasets, which demonstrate the state-of-the-art performance of the proposed method on the RSCTIR task.

Original language	English
Article number	5626317
Journal	IEEE Transactions on Geoscience and Remote Sensing
Volume	61
DOIs	https://doi.org/10.1109/TGRS.2023.3333375
State	Published - 2023

Keywords

Mask-guided attention strategy
relation exploration
remote sensing cross-modal text-image retrieval (RSCTIR)
uni-modal entity loss

Access to Document

10.1109/TGRS.2023.3333375

Cite this

@article{60654b77e2f94affa1d49de438280948,

title = "Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval",

abstract = "Remote sensing cross-modal text-image retrieval (RSCTIR) has recently received unprecedented attention due to its advantages of flexible input and efficient query on enormous remote sensing (RS) images. However, most RSCTIR methods focus obsessively on the cross-modal semantic alignment between text and image modalities and are easily stuck in the dilemma of information redundancy, leading to the degradation of retrieval accuracy. To address the issues, we construct a novel RSCTIR framework based on mask-guided relation modeling with entity loss (MGRM-EL), to fully explore uni-modal feature learning on entities and relations in the cross-modal model learning process. Specifically, we take advantage of the Transformer encoder architecture for its ability to capture long-distance dependencies from the global view and build two uni-modal (visual and textual) Transformer encoders combined with convolutional neural network (CNN) to extract the spatial interregion relations of images as well as long-term interword relations of texts for prominent feature embedding of visual and semantic representations. A mask-guided attention strategy is further introduced to learn the salient regions and words, with the aim of enhancing the RSCTIR model's uni-modal learning ability and eliminating unnecessary and redundant information about each modality. Unlike existing methods that simply compute the semantic similarity between images and texts on the loss functions, we present a novel uni-modal entity loss, which treats each image as an image entity and merges similar texts into a text entity, to learn the independent distribution of entities in each modality. We conduct extensive experiments on public RSCTIR benchmarks including RSICD and RSITMD datasets, which demonstrate the state-of-the-art performance of the proposed method on the RSCTIR task.",

keywords = "Mask-guided attention strategy, relation exploration, remote sensing cross-modal text-image retrieval (RSCTIR), uni-modal entity loss",

author = "Shun Zhang and Yupeng Li and Shaohui Mei",

note = "Publisher Copyright: {\textcopyright} 1980-2012 IEEE.",

year = "2023",

doi = "10.1109/TGRS.2023.3333375",

language = "英语",

volume = "61",

journal = "IEEE Transactions on Geoscience and Remote Sensing",

issn = "0196-2892",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval

AU - Zhang, Shun

AU - Li, Yupeng

AU - Mei, Shaohui

PY - 2023

Y1 - 2023

N2 - Remote sensing cross-modal text-image retrieval (RSCTIR) has recently received unprecedented attention due to its advantages of flexible input and efficient query on enormous remote sensing (RS) images. However, most RSCTIR methods focus obsessively on the cross-modal semantic alignment between text and image modalities and are easily stuck in the dilemma of information redundancy, leading to the degradation of retrieval accuracy. To address the issues, we construct a novel RSCTIR framework based on mask-guided relation modeling with entity loss (MGRM-EL), to fully explore uni-modal feature learning on entities and relations in the cross-modal model learning process. Specifically, we take advantage of the Transformer encoder architecture for its ability to capture long-distance dependencies from the global view and build two uni-modal (visual and textual) Transformer encoders combined with convolutional neural network (CNN) to extract the spatial interregion relations of images as well as long-term interword relations of texts for prominent feature embedding of visual and semantic representations. A mask-guided attention strategy is further introduced to learn the salient regions and words, with the aim of enhancing the RSCTIR model's uni-modal learning ability and eliminating unnecessary and redundant information about each modality. Unlike existing methods that simply compute the semantic similarity between images and texts on the loss functions, we present a novel uni-modal entity loss, which treats each image as an image entity and merges similar texts into a text entity, to learn the independent distribution of entities in each modality. We conduct extensive experiments on public RSCTIR benchmarks including RSICD and RSITMD datasets, which demonstrate the state-of-the-art performance of the proposed method on the RSCTIR task.

AB - Remote sensing cross-modal text-image retrieval (RSCTIR) has recently received unprecedented attention due to its advantages of flexible input and efficient query on enormous remote sensing (RS) images. However, most RSCTIR methods focus obsessively on the cross-modal semantic alignment between text and image modalities and are easily stuck in the dilemma of information redundancy, leading to the degradation of retrieval accuracy. To address the issues, we construct a novel RSCTIR framework based on mask-guided relation modeling with entity loss (MGRM-EL), to fully explore uni-modal feature learning on entities and relations in the cross-modal model learning process. Specifically, we take advantage of the Transformer encoder architecture for its ability to capture long-distance dependencies from the global view and build two uni-modal (visual and textual) Transformer encoders combined with convolutional neural network (CNN) to extract the spatial interregion relations of images as well as long-term interword relations of texts for prominent feature embedding of visual and semantic representations. A mask-guided attention strategy is further introduced to learn the salient regions and words, with the aim of enhancing the RSCTIR model's uni-modal learning ability and eliminating unnecessary and redundant information about each modality. Unlike existing methods that simply compute the semantic similarity between images and texts on the loss functions, we present a novel uni-modal entity loss, which treats each image as an image entity and merges similar texts into a text entity, to learn the independent distribution of entities in each modality. We conduct extensive experiments on public RSCTIR benchmarks including RSICD and RSITMD datasets, which demonstrate the state-of-the-art performance of the proposed method on the RSCTIR task.

KW - Mask-guided attention strategy

KW - relation exploration

KW - remote sensing cross-modal text-image retrieval (RSCTIR)

KW - uni-modal entity loss

UR - http://www.scopus.com/inward/record.url?scp=85178066904&partnerID=8YFLogxK

U2 - 10.1109/TGRS.2023.3333375

DO - 10.1109/TGRS.2023.3333375

M3 - 文章

AN - SCOPUS:85178066904

SN - 0196-2892

VL - 61

JO - IEEE Transactions on Geoscience and Remote Sensing

JF - IEEE Transactions on Geoscience and Remote Sensing

M1 - 5626317

ER -

Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this