Skip to main navigation Skip to search Skip to main content

Global-Guided Asymmetric Attention Network for Image-Text Matching

  • Dongqing Wu
  • , Huihui Li
  • , Yinge Tang
  • , Lei Guo
  • , Hang Liu
  • Northwestern Polytechnical University Xian

Research output: Contribution to journalArticlepeer-review

9 Scopus citations

Abstract

Image-text matching is a vital yet challenging task in the field of vision and language. Unlike previous methods that usually adopt a symmetrical network to independently embed images and sentences into a joint latent space, we propose a novel Global-guided Asymmetric Attention Network (GAAN) to represent the two modalities more comprehensively. Specifically, we first design a Global Information-guided Transformer Encoder (GITE) to effectively mitigate the drawback of the lack of contextual information of the region features. Taking full advantage of the image global information, GITE is able to model the regional relations and region-global relations simultaneously, so as to obtain a more accurate visual representation. Then, we adopt a Textual Self-Attention (TSA) module to explore the word-word relations and produce the context-aware word representations. Finally, we deploy an Image-guided Textual Attention (ITA) module to explore the fine-grained correspondence between image regions and sentence words. By using context-aware visual information to guide textual representation learning, we can build asymmetric connections between vision and language to better exploit textual information. Experimental results on two benchmark datasets including MSCOCO and Flickr30k show that GAAN significantly surpasses state-of-the-art methods.

Original languageEnglish
Pages (from-to)77-90
Number of pages14
JournalNeurocomputing
Volume481
DOIs
StatePublished - 7 Apr 2022

Keywords

  • Asymmetric relation modeling
  • Cross-attention
  • Image-text matching
  • Self-attention

Fingerprint

Dive into the research topics of 'Global-Guided Asymmetric Attention Network for Image-Text Matching'. Together they form a unique fingerprint.

Cite this