A Proposal-Free One-Stage Framework for Referring Expression Comprehension and Generation via Dense Cross-Attention

Mengyang Sun, Wei Suo, Peng Wang, Yanning Zhang, Qi Wu

Research output: Contribution to journalArticlepeer-review

31 Scopus citations

Abstract

Referring Expression Comprehension (REC) and Generation (REG) have become one of the most important tasks in visual reasoning, since it is an essential step for many vision-and-language tasks such as visual question answering or visual dialogue. However, it has not been widely used in many downstream tasks, mainly for the following reasons: 1) mainstream two-stage methods rely on additional annotations or off-the-shelf detectors to generate proposals. It would heavily degrade the generalization ability of models and lead to inevitable error accumulation. 2) Although one-stage strategies for REC have been proposed, these methods have to depend on lots of hyper-parameters (such as anchors) to generate bounding box. In this paper, we present a proposal-free one-stage (PFOS) framework that can directly regress the region-of-interest from the image or generate unambiguous descriptions in an end-to-end manner. Instead of using the dominant two-stage fashion, we take the dense-grid of images as input for a cross-attention transformer that learns multi-modal correspondences. The final bounding box or sentence is directly predicted from the image without the anchor selection or the computation of visual difference. Furthermore, we expand the traditional two-stage listener-speaker framework to jointly train by a one-stage learning paradigm. Our model achieves state-of-the-art performance on both accuracy and speed for comprehension and competitive results for generation.

Original languageEnglish
Pages (from-to)2446-2458
Number of pages13
JournalIEEE Transactions on Multimedia
Volume25
DOIs
StatePublished - 2023

Keywords

  • one-stage method
  • Referring expression comprehension
  • referring expression generation

Fingerprint

Dive into the research topics of 'A Proposal-Free One-Stage Framework for Referring Expression Comprehension and Generation via Dense Cross-Attention'. Together they form a unique fingerprint.

Cite this