Dual Stream Relation Learning Network for Image-Text Retrieval

Dongqing Wu; Huihui Li; Cang Gu; Lei Guo; Hang Liu

doi:10.1109/TMM.2024.3521736

Dual Stream Relation Learning Network for Image-Text Retrieval

Dongqing Wu, Huihui Li, Cang Gu, Lei Guo, Hang Liu

自动化学院

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

Image-text retrieval has made remarkable achievements through the development of feature extraction networks and model architectures. However, almost all region feature-based methods face two serious problems when modeling modality interactions. First, region features are prone to feature entanglement in the feature extraction stage, making it difficult to accurately reason complex intra-model relations between visual objects. Second, region features lack rich contextual information, background, and object details, making it difficult to achieve precise inter-modal alignment with textual information. In this paper, we propose a novel Dual Stream Relation Learning Network (DSRLN) to jointly solve these issues with two key components: a Geometry-sensitive Interactive Self-Attention (GISA) module and a Dual Information Fusion (DIF) module. Specifically, GISA extends the vanilla self-attention network from two aspects to better model the intrinsic relationships between different regions, thereby improving high-level visual-semantic reasoning ability. DIF uses grid features as an additional visual information source, and achieves deeper and complex fusion between the two types of features through a masked cross-attention module and an adaptive gate fusion module, which can capture comprehensive visual information to learn more precise inter-modal alignment. Besides, our method also learns a more comprehensive hierarchical correspondence between images and sentences through local and global alignment. Experimental results on two public datasets, i.e., Flickr30K and MS-COCO, fully demonstrate the superiority and effectiveness of our model.

源语言	英语
页（从-至）	1551-1565
页数	15
期刊	IEEE Transactions on Multimedia
卷	27
DOI	https://doi.org/10.1109/TMM.2024.3521736
出版状态	已出版 - 2025

访问文件

10.1109/TMM.2024.3521736

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{c8970a3e9d74470bb7ca058219f29de3,

title = "Dual Stream Relation Learning Network for Image-Text Retrieval",

abstract = "Image-text retrieval has made remarkable achievements through the development of feature extraction networks and model architectures. However, almost all region feature-based methods face two serious problems when modeling modality interactions. First, region features are prone to feature entanglement in the feature extraction stage, making it difficult to accurately reason complex intra-model relations between visual objects. Second, region features lack rich contextual information, background, and object details, making it difficult to achieve precise inter-modal alignment with textual information. In this paper, we propose a novel Dual Stream Relation Learning Network (DSRLN) to jointly solve these issues with two key components: a Geometry-sensitive Interactive Self-Attention (GISA) module and a Dual Information Fusion (DIF) module. Specifically, GISA extends the vanilla self-attention network from two aspects to better model the intrinsic relationships between different regions, thereby improving high-level visual-semantic reasoning ability. DIF uses grid features as an additional visual information source, and achieves deeper and complex fusion between the two types of features through a masked cross-attention module and an adaptive gate fusion module, which can capture comprehensive visual information to learn more precise inter-modal alignment. Besides, our method also learns a more comprehensive hierarchical correspondence between images and sentences through local and global alignment. Experimental results on two public datasets, i.e., Flickr30K and MS-COCO, fully demonstrate the superiority and effectiveness of our model.",

keywords = "grid feature, Image-text retrieval, region feature, self-attention",

author = "Dongqing Wu and Huihui Li and Cang Gu and Lei Guo and Hang Liu",

note = "Publisher Copyright: {\textcopyright} 1999-2012 IEEE.",

year = "2025",

doi = "10.1109/TMM.2024.3521736",

language = "英语",

volume = "27",

pages = "1551--1565",

journal = "IEEE Transactions on Multimedia",

issn = "1520-9210",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Dual Stream Relation Learning Network for Image-Text Retrieval

AU - Wu, Dongqing

AU - Li, Huihui

AU - Gu, Cang

AU - Guo, Lei

AU - Liu, Hang

PY - 2025

Y1 - 2025

N2 - Image-text retrieval has made remarkable achievements through the development of feature extraction networks and model architectures. However, almost all region feature-based methods face two serious problems when modeling modality interactions. First, region features are prone to feature entanglement in the feature extraction stage, making it difficult to accurately reason complex intra-model relations between visual objects. Second, region features lack rich contextual information, background, and object details, making it difficult to achieve precise inter-modal alignment with textual information. In this paper, we propose a novel Dual Stream Relation Learning Network (DSRLN) to jointly solve these issues with two key components: a Geometry-sensitive Interactive Self-Attention (GISA) module and a Dual Information Fusion (DIF) module. Specifically, GISA extends the vanilla self-attention network from two aspects to better model the intrinsic relationships between different regions, thereby improving high-level visual-semantic reasoning ability. DIF uses grid features as an additional visual information source, and achieves deeper and complex fusion between the two types of features through a masked cross-attention module and an adaptive gate fusion module, which can capture comprehensive visual information to learn more precise inter-modal alignment. Besides, our method also learns a more comprehensive hierarchical correspondence between images and sentences through local and global alignment. Experimental results on two public datasets, i.e., Flickr30K and MS-COCO, fully demonstrate the superiority and effectiveness of our model.

AB - Image-text retrieval has made remarkable achievements through the development of feature extraction networks and model architectures. However, almost all region feature-based methods face two serious problems when modeling modality interactions. First, region features are prone to feature entanglement in the feature extraction stage, making it difficult to accurately reason complex intra-model relations between visual objects. Second, region features lack rich contextual information, background, and object details, making it difficult to achieve precise inter-modal alignment with textual information. In this paper, we propose a novel Dual Stream Relation Learning Network (DSRLN) to jointly solve these issues with two key components: a Geometry-sensitive Interactive Self-Attention (GISA) module and a Dual Information Fusion (DIF) module. Specifically, GISA extends the vanilla self-attention network from two aspects to better model the intrinsic relationships between different regions, thereby improving high-level visual-semantic reasoning ability. DIF uses grid features as an additional visual information source, and achieves deeper and complex fusion between the two types of features through a masked cross-attention module and an adaptive gate fusion module, which can capture comprehensive visual information to learn more precise inter-modal alignment. Besides, our method also learns a more comprehensive hierarchical correspondence between images and sentences through local and global alignment. Experimental results on two public datasets, i.e., Flickr30K and MS-COCO, fully demonstrate the superiority and effectiveness of our model.

KW - grid feature

KW - Image-text retrieval

KW - region feature

KW - self-attention

UR - http://www.scopus.com/inward/record.url?scp=105001086925&partnerID=8YFLogxK

U2 - 10.1109/TMM.2024.3521736

DO - 10.1109/TMM.2024.3521736

M3 - 文章

AN - SCOPUS:105001086925

SN - 1520-9210

VL - 27

SP - 1551

EP - 1565

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

ER -

Dual Stream Relation Learning Network for Image-Text Retrieval

摘要

访问文件

其它文件与链接

指纹

引用此