TY - JOUR
T1 - Dual Stream Relation Learning Network for Image-Text Retrieval
AU - Wu, Dongqing
AU - Li, Huihui
AU - Gu, Cang
AU - Guo, Lei
AU - Liu, Hang
N1 - Publisher Copyright:
© 1999-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Image-text retrieval has made remarkable achievements through the development of feature extraction networks and model architectures. However, almost all region feature-based methods face two serious problems when modeling modality interactions. First, region features are prone to feature entanglement in the feature extraction stage, making it difficult to accurately reason complex intra-model relations between visual objects. Second, region features lack rich contextual information, background, and object details, making it difficult to achieve precise inter-modal alignment with textual information. In this paper, we propose a novel Dual Stream Relation Learning Network (DSRLN) to jointly solve these issues with two key components: a Geometry-sensitive Interactive Self-Attention (GISA) module and a Dual Information Fusion (DIF) module. Specifically, GISA extends the vanilla self-attention network from two aspects to better model the intrinsic relationships between different regions, thereby improving high-level visual-semantic reasoning ability. DIF uses grid features as an additional visual information source, and achieves deeper and complex fusion between the two types of features through a masked cross-attention module and an adaptive gate fusion module, which can capture comprehensive visual information to learn more precise inter-modal alignment. Besides, our method also learns a more comprehensive hierarchical correspondence between images and sentences through local and global alignment. Experimental results on two public datasets, i.e., Flickr30K and MS-COCO, fully demonstrate the superiority and effectiveness of our model.
AB - Image-text retrieval has made remarkable achievements through the development of feature extraction networks and model architectures. However, almost all region feature-based methods face two serious problems when modeling modality interactions. First, region features are prone to feature entanglement in the feature extraction stage, making it difficult to accurately reason complex intra-model relations between visual objects. Second, region features lack rich contextual information, background, and object details, making it difficult to achieve precise inter-modal alignment with textual information. In this paper, we propose a novel Dual Stream Relation Learning Network (DSRLN) to jointly solve these issues with two key components: a Geometry-sensitive Interactive Self-Attention (GISA) module and a Dual Information Fusion (DIF) module. Specifically, GISA extends the vanilla self-attention network from two aspects to better model the intrinsic relationships between different regions, thereby improving high-level visual-semantic reasoning ability. DIF uses grid features as an additional visual information source, and achieves deeper and complex fusion between the two types of features through a masked cross-attention module and an adaptive gate fusion module, which can capture comprehensive visual information to learn more precise inter-modal alignment. Besides, our method also learns a more comprehensive hierarchical correspondence between images and sentences through local and global alignment. Experimental results on two public datasets, i.e., Flickr30K and MS-COCO, fully demonstrate the superiority and effectiveness of our model.
KW - grid feature
KW - Image-text retrieval
KW - region feature
KW - self-attention
UR - http://www.scopus.com/inward/record.url?scp=105001086925&partnerID=8YFLogxK
U2 - 10.1109/TMM.2024.3521736
DO - 10.1109/TMM.2024.3521736
M3 - 文章
AN - SCOPUS:105001086925
SN - 1520-9210
VL - 27
SP - 1551
EP - 1565
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
ER -