Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval

Dongqing Wu; Huihui Li; Cang Gu; Lei Guo; Hang Liu

doi:10.1145/3503161.3548223

Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval

Dongqing Wu, Huihui Li, Cang Gu, Lei Guo, Hang Liu

自动化学院

Northwestern Polytechnical University Xian

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

6 引用（Scopus）

摘要

In recent years, region features extracted from object detection networks have been widely used in the image-text retrieval task. However, they lack rich background and contextual information, which makes it difficult to match words describing global concepts in sentences. Meanwhile, the region features also lose the details of objects in the image. Fortunately, these disadvantages of region features are the advantages of grid features. In this paper, we propose a novel framework, which fuses the region features and grid features through a two-step interaction strategy, thus extracting a more comprehensive image representation for image-text retrieval. Concretely, in the first step, a joint graph with spatial information constraints is constructed, where all region features and grid features are represented as graph nodes. By modeling the relationships using the joint graph, the information can be passed edge-wise. In the second step, we propose a Cross-attention Gated Fusion module, which further explores the complex interactions between region features and grid features, and then adaptively fuses different types of features. With these two steps, our model can fully realize the complementary advantages of region features and grid features. In addition, we propose a Multi-Attention Pooling module to better aggregate the fused region features and grid features. Extensive experiments on two public datasets, including Flickr30K and MS-COCO, demonstrate that our model achieves the state-of-the-art and pushes the performance of image-text retrieval to a new height.

源语言	英语
主期刊名	MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia
出版商	Association for Computing Machinery, Inc
页	5055-5064
页数	10
ISBN（电子版）	9781450392037
DOI	https://doi.org/10.1145/3503161.3548223
出版状态	已出版 - 10 10月 2022
活动	30th ACM International Conference on Multimedia, MM 2022 - Lisboa, 葡萄牙期限: 10 10月 2022 → 14 10月 2022

出版系列

姓名	MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia

会议

会议	30th ACM International Conference on Multimedia, MM 2022
国家/地区	葡萄牙
市	Lisboa
时期	10/10/22 → 14/10/22

访问文件

10.1145/3503161.3548223

其它文件与链接

链接到 Scopus 的出版物

引用此

Wu, D., Li, H., Gu, C., Guo, L., & Liu, H. (2022). Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval. 在 MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia (页码 5055-5064). (MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia). Association for Computing Machinery, Inc. https://doi.org/10.1145/3503161.3548223

Wu, Dongqing ; Li, Huihui ; Gu, Cang 等. / Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval. MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia. Association for Computing Machinery, Inc, 2022. 页码 5055-5064 (MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia).

@inproceedings{2342879e760f4469b8e5e5467970d1f2,

title = "Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval",

abstract = "In recent years, region features extracted from object detection networks have been widely used in the image-text retrieval task. However, they lack rich background and contextual information, which makes it difficult to match words describing global concepts in sentences. Meanwhile, the region features also lose the details of objects in the image. Fortunately, these disadvantages of region features are the advantages of grid features. In this paper, we propose a novel framework, which fuses the region features and grid features through a two-step interaction strategy, thus extracting a more comprehensive image representation for image-text retrieval. Concretely, in the first step, a joint graph with spatial information constraints is constructed, where all region features and grid features are represented as graph nodes. By modeling the relationships using the joint graph, the information can be passed edge-wise. In the second step, we propose a Cross-attention Gated Fusion module, which further explores the complex interactions between region features and grid features, and then adaptively fuses different types of features. With these two steps, our model can fully realize the complementary advantages of region features and grid features. In addition, we propose a Multi-Attention Pooling module to better aggregate the fused region features and grid features. Extensive experiments on two public datasets, including Flickr30K and MS-COCO, demonstrate that our model achieves the state-of-the-art and pushes the performance of image-text retrieval to a new height.",

keywords = "cross-modal retrieval, feature interaction and fusion, graph attention networks, image-text matching",

author = "Dongqing Wu and Huihui Li and Cang Gu and Lei Guo and Hang Liu",

note = "Publisher Copyright: {\textcopyright} 2022 ACM.; 30th ACM International Conference on Multimedia, MM 2022 ; Conference date: 10-10-2022 Through 14-10-2022",

year = "2022",

month = oct,

day = "10",

doi = "10.1145/3503161.3548223",

language = "英语",

series = "MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia",

publisher = "Association for Computing Machinery, Inc",

pages = "5055--5064",

booktitle = "MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia",

}

Wu, D, Li, H, Gu, C, Guo, L & Liu, H 2022, Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval. 在 MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia. MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia, Association for Computing Machinery, Inc, 页码 5055-5064, 30th ACM International Conference on Multimedia, MM 2022, Lisboa, 葡萄牙, 10/10/22. https://doi.org/10.1145/3503161.3548223

Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval. / Wu, Dongqing; Li, Huihui; Gu, Cang 等.
MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia. Association for Computing Machinery, Inc, 2022. 页码 5055-5064 (MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval

AU - Wu, Dongqing

AU - Li, Huihui

AU - Gu, Cang

AU - Guo, Lei

AU - Liu, Hang

PY - 2022/10/10

Y1 - 2022/10/10

N2 - In recent years, region features extracted from object detection networks have been widely used in the image-text retrieval task. However, they lack rich background and contextual information, which makes it difficult to match words describing global concepts in sentences. Meanwhile, the region features also lose the details of objects in the image. Fortunately, these disadvantages of region features are the advantages of grid features. In this paper, we propose a novel framework, which fuses the region features and grid features through a two-step interaction strategy, thus extracting a more comprehensive image representation for image-text retrieval. Concretely, in the first step, a joint graph with spatial information constraints is constructed, where all region features and grid features are represented as graph nodes. By modeling the relationships using the joint graph, the information can be passed edge-wise. In the second step, we propose a Cross-attention Gated Fusion module, which further explores the complex interactions between region features and grid features, and then adaptively fuses different types of features. With these two steps, our model can fully realize the complementary advantages of region features and grid features. In addition, we propose a Multi-Attention Pooling module to better aggregate the fused region features and grid features. Extensive experiments on two public datasets, including Flickr30K and MS-COCO, demonstrate that our model achieves the state-of-the-art and pushes the performance of image-text retrieval to a new height.

AB - In recent years, region features extracted from object detection networks have been widely used in the image-text retrieval task. However, they lack rich background and contextual information, which makes it difficult to match words describing global concepts in sentences. Meanwhile, the region features also lose the details of objects in the image. Fortunately, these disadvantages of region features are the advantages of grid features. In this paper, we propose a novel framework, which fuses the region features and grid features through a two-step interaction strategy, thus extracting a more comprehensive image representation for image-text retrieval. Concretely, in the first step, a joint graph with spatial information constraints is constructed, where all region features and grid features are represented as graph nodes. By modeling the relationships using the joint graph, the information can be passed edge-wise. In the second step, we propose a Cross-attention Gated Fusion module, which further explores the complex interactions between region features and grid features, and then adaptively fuses different types of features. With these two steps, our model can fully realize the complementary advantages of region features and grid features. In addition, we propose a Multi-Attention Pooling module to better aggregate the fused region features and grid features. Extensive experiments on two public datasets, including Flickr30K and MS-COCO, demonstrate that our model achieves the state-of-the-art and pushes the performance of image-text retrieval to a new height.

KW - cross-modal retrieval

KW - feature interaction and fusion

KW - graph attention networks

KW - image-text matching

UR - http://www.scopus.com/inward/record.url?scp=85150990925&partnerID=8YFLogxK

U2 - 10.1145/3503161.3548223

DO - 10.1145/3503161.3548223

M3 - 会议稿件

AN - SCOPUS:85150990925

T3 - MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia

SP - 5055

EP - 5064

BT - MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia

PB - Association for Computing Machinery, Inc

T2 - 30th ACM International Conference on Multimedia, MM 2022

Y2 - 10 October 2022 through 14 October 2022

ER -

Wu D, Li H, Gu C, Guo L, Liu H. Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval. 在 MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia. Association for Computing Machinery, Inc. 2022. 页码 5055-5064. (MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia). doi: 10.1145/3503161.3548223

Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此