Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval

Dongqing Wu; Huihui Li; Cang Gu; Lei Guo; Hang Liu

doi:10.1145/3503161.3548223

Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval

Dongqing Wu, Huihui Li, Cang Gu, Lei Guo, Hang Liu

School of Automation

Northwestern Polytechnical University Xian

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

6 Scopus citations

Abstract

In recent years, region features extracted from object detection networks have been widely used in the image-text retrieval task. However, they lack rich background and contextual information, which makes it difficult to match words describing global concepts in sentences. Meanwhile, the region features also lose the details of objects in the image. Fortunately, these disadvantages of region features are the advantages of grid features. In this paper, we propose a novel framework, which fuses the region features and grid features through a two-step interaction strategy, thus extracting a more comprehensive image representation for image-text retrieval. Concretely, in the first step, a joint graph with spatial information constraints is constructed, where all region features and grid features are represented as graph nodes. By modeling the relationships using the joint graph, the information can be passed edge-wise. In the second step, we propose a Cross-attention Gated Fusion module, which further explores the complex interactions between region features and grid features, and then adaptively fuses different types of features. With these two steps, our model can fully realize the complementary advantages of region features and grid features. In addition, we propose a Multi-Attention Pooling module to better aggregate the fused region features and grid features. Extensive experiments on two public datasets, including Flickr30K and MS-COCO, demonstrate that our model achieves the state-of-the-art and pushes the performance of image-text retrieval to a new height.

Original language	English
Title of host publication	MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia
Publisher	Association for Computing Machinery, Inc
Pages	5055-5064
Number of pages	10
ISBN (Electronic)	9781450392037
DOIs	https://doi.org/10.1145/3503161.3548223
State	Published - 10 Oct 2022
Event	30th ACM International Conference on Multimedia, MM 2022 - Lisboa, Portugal Duration: 10 Oct 2022 → 14 Oct 2022

Publication series

Name	MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia

Conference

Conference	30th ACM International Conference on Multimedia, MM 2022
Country/Territory	Portugal
City	Lisboa
Period	10/10/22 → 14/10/22

Keywords

cross-modal retrieval
feature interaction and fusion
graph attention networks
image-text matching

Access to Document

10.1145/3503161.3548223

Cite this

Wu, D., Li, H., Gu, C., Guo, L., & Liu, H. (2022). Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval. In MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia (pp. 5055-5064). (MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia). Association for Computing Machinery, Inc. https://doi.org/10.1145/3503161.3548223

Wu, Dongqing ; Li, Huihui ; Gu, Cang et al. / Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval. MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia. Association for Computing Machinery, Inc, 2022. pp. 5055-5064 (MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia).

@inproceedings{2342879e760f4469b8e5e5467970d1f2,

title = "Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval",

abstract = "In recent years, region features extracted from object detection networks have been widely used in the image-text retrieval task. However, they lack rich background and contextual information, which makes it difficult to match words describing global concepts in sentences. Meanwhile, the region features also lose the details of objects in the image. Fortunately, these disadvantages of region features are the advantages of grid features. In this paper, we propose a novel framework, which fuses the region features and grid features through a two-step interaction strategy, thus extracting a more comprehensive image representation for image-text retrieval. Concretely, in the first step, a joint graph with spatial information constraints is constructed, where all region features and grid features are represented as graph nodes. By modeling the relationships using the joint graph, the information can be passed edge-wise. In the second step, we propose a Cross-attention Gated Fusion module, which further explores the complex interactions between region features and grid features, and then adaptively fuses different types of features. With these two steps, our model can fully realize the complementary advantages of region features and grid features. In addition, we propose a Multi-Attention Pooling module to better aggregate the fused region features and grid features. Extensive experiments on two public datasets, including Flickr30K and MS-COCO, demonstrate that our model achieves the state-of-the-art and pushes the performance of image-text retrieval to a new height.",

keywords = "cross-modal retrieval, feature interaction and fusion, graph attention networks, image-text matching",

author = "Dongqing Wu and Huihui Li and Cang Gu and Lei Guo and Hang Liu",

note = "Publisher Copyright: {\textcopyright} 2022 ACM.; 30th ACM International Conference on Multimedia, MM 2022 ; Conference date: 10-10-2022 Through 14-10-2022",

year = "2022",

month = oct,

day = "10",

doi = "10.1145/3503161.3548223",

language = "英语",

series = "MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia",

publisher = "Association for Computing Machinery, Inc",

pages = "5055--5064",

booktitle = "MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia",

}

Wu, D, Li, H, Gu, C, Guo, L & Liu, H 2022, Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval. in MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia. MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia, Association for Computing Machinery, Inc, pp. 5055-5064, 30th ACM International Conference on Multimedia, MM 2022, Lisboa, Portugal, 10/10/22. https://doi.org/10.1145/3503161.3548223

Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval. / Wu, Dongqing; Li, Huihui; Gu, Cang et al.
MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia. Association for Computing Machinery, Inc, 2022. p. 5055-5064 (MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval

AU - Wu, Dongqing

AU - Li, Huihui

AU - Gu, Cang

AU - Guo, Lei

AU - Liu, Hang

PY - 2022/10/10

Y1 - 2022/10/10

N2 - In recent years, region features extracted from object detection networks have been widely used in the image-text retrieval task. However, they lack rich background and contextual information, which makes it difficult to match words describing global concepts in sentences. Meanwhile, the region features also lose the details of objects in the image. Fortunately, these disadvantages of region features are the advantages of grid features. In this paper, we propose a novel framework, which fuses the region features and grid features through a two-step interaction strategy, thus extracting a more comprehensive image representation for image-text retrieval. Concretely, in the first step, a joint graph with spatial information constraints is constructed, where all region features and grid features are represented as graph nodes. By modeling the relationships using the joint graph, the information can be passed edge-wise. In the second step, we propose a Cross-attention Gated Fusion module, which further explores the complex interactions between region features and grid features, and then adaptively fuses different types of features. With these two steps, our model can fully realize the complementary advantages of region features and grid features. In addition, we propose a Multi-Attention Pooling module to better aggregate the fused region features and grid features. Extensive experiments on two public datasets, including Flickr30K and MS-COCO, demonstrate that our model achieves the state-of-the-art and pushes the performance of image-text retrieval to a new height.

AB - In recent years, region features extracted from object detection networks have been widely used in the image-text retrieval task. However, they lack rich background and contextual information, which makes it difficult to match words describing global concepts in sentences. Meanwhile, the region features also lose the details of objects in the image. Fortunately, these disadvantages of region features are the advantages of grid features. In this paper, we propose a novel framework, which fuses the region features and grid features through a two-step interaction strategy, thus extracting a more comprehensive image representation for image-text retrieval. Concretely, in the first step, a joint graph with spatial information constraints is constructed, where all region features and grid features are represented as graph nodes. By modeling the relationships using the joint graph, the information can be passed edge-wise. In the second step, we propose a Cross-attention Gated Fusion module, which further explores the complex interactions between region features and grid features, and then adaptively fuses different types of features. With these two steps, our model can fully realize the complementary advantages of region features and grid features. In addition, we propose a Multi-Attention Pooling module to better aggregate the fused region features and grid features. Extensive experiments on two public datasets, including Flickr30K and MS-COCO, demonstrate that our model achieves the state-of-the-art and pushes the performance of image-text retrieval to a new height.

KW - cross-modal retrieval

KW - feature interaction and fusion

KW - graph attention networks

KW - image-text matching

UR - http://www.scopus.com/inward/record.url?scp=85150990925&partnerID=8YFLogxK

U2 - 10.1145/3503161.3548223

DO - 10.1145/3503161.3548223

M3 - 会议稿件

AN - SCOPUS:85150990925

T3 - MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia

SP - 5055

EP - 5064

BT - MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia

PB - Association for Computing Machinery, Inc

T2 - 30th ACM International Conference on Multimedia, MM 2022

Y2 - 10 October 2022 through 14 October 2022

ER -

Wu D, Li H, Gu C, Guo L, Liu H. Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval. In MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia. Association for Computing Machinery, Inc. 2022. p. 5055-5064. (MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia). doi: 10.1145/3503161.3548223

Improving Fusion of Region Features and Grid Features via Two-Step Interaction for Image-Text Retrieval

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this