A Proposal-Free One-Stage Framework for Referring Expression Comprehension and Generation via Dense Cross-Attention

Mengyang Sun; Wei Suo; Peng Wang; Yanning Zhang; Qi Wu

doi:10.1109/TMM.2022.3147385

A Proposal-Free One-Stage Framework for Referring Expression Comprehension and Generation via Dense Cross-Attention

Mengyang Sun, Wei Suo, Peng Wang, Yanning Zhang, Qi Wu

School of Computer Science

Research output: Contribution to journal › Article › peer-review

31 Scopus citations

Abstract

Referring Expression Comprehension (REC) and Generation (REG) have become one of the most important tasks in visual reasoning, since it is an essential step for many vision-and-language tasks such as visual question answering or visual dialogue. However, it has not been widely used in many downstream tasks, mainly for the following reasons: 1) mainstream two-stage methods rely on additional annotations or off-the-shelf detectors to generate proposals. It would heavily degrade the generalization ability of models and lead to inevitable error accumulation. 2) Although one-stage strategies for REC have been proposed, these methods have to depend on lots of hyper-parameters (such as anchors) to generate bounding box. In this paper, we present a proposal-free one-stage (PFOS) framework that can directly regress the region-of-interest from the image or generate unambiguous descriptions in an end-to-end manner. Instead of using the dominant two-stage fashion, we take the dense-grid of images as input for a cross-attention transformer that learns multi-modal correspondences. The final bounding box or sentence is directly predicted from the image without the anchor selection or the computation of visual difference. Furthermore, we expand the traditional two-stage listener-speaker framework to jointly train by a one-stage learning paradigm. Our model achieves state-of-the-art performance on both accuracy and speed for comprehension and competitive results for generation.

Original language	English
Pages (from-to)	2446-2458
Number of pages	13
Journal	IEEE Transactions on Multimedia
Volume	25
DOIs	https://doi.org/10.1109/TMM.2022.3147385
State	Published - 2023

Keywords

one-stage method
Referring expression comprehension
referring expression generation

Access to Document

10.1109/TMM.2022.3147385

Cite this

@article{4971dd2cc89d4dfeb88348f60bf9745c,

title = "A Proposal-Free One-Stage Framework for Referring Expression Comprehension and Generation via Dense Cross-Attention",

abstract = "Referring Expression Comprehension (REC) and Generation (REG) have become one of the most important tasks in visual reasoning, since it is an essential step for many vision-and-language tasks such as visual question answering or visual dialogue. However, it has not been widely used in many downstream tasks, mainly for the following reasons: 1) mainstream two-stage methods rely on additional annotations or off-the-shelf detectors to generate proposals. It would heavily degrade the generalization ability of models and lead to inevitable error accumulation. 2) Although one-stage strategies for REC have been proposed, these methods have to depend on lots of hyper-parameters (such as anchors) to generate bounding box. In this paper, we present a proposal-free one-stage (PFOS) framework that can directly regress the region-of-interest from the image or generate unambiguous descriptions in an end-to-end manner. Instead of using the dominant two-stage fashion, we take the dense-grid of images as input for a cross-attention transformer that learns multi-modal correspondences. The final bounding box or sentence is directly predicted from the image without the anchor selection or the computation of visual difference. Furthermore, we expand the traditional two-stage listener-speaker framework to jointly train by a one-stage learning paradigm. Our model achieves state-of-the-art performance on both accuracy and speed for comprehension and competitive results for generation.",

keywords = "one-stage method, Referring expression comprehension, referring expression generation",

author = "Mengyang Sun and Wei Suo and Peng Wang and Yanning Zhang and Qi Wu",

note = "Publisher Copyright: {\textcopyright} 2021 IEEE.",

year = "2023",

doi = "10.1109/TMM.2022.3147385",

language = "英语",

volume = "25",

pages = "2446--2458",

journal = "IEEE Transactions on Multimedia",

issn = "1520-9210",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - A Proposal-Free One-Stage Framework for Referring Expression Comprehension and Generation via Dense Cross-Attention

AU - Sun, Mengyang

AU - Suo, Wei

AU - Wang, Peng

AU - Zhang, Yanning

AU - Wu, Qi

PY - 2023

Y1 - 2023

N2 - Referring Expression Comprehension (REC) and Generation (REG) have become one of the most important tasks in visual reasoning, since it is an essential step for many vision-and-language tasks such as visual question answering or visual dialogue. However, it has not been widely used in many downstream tasks, mainly for the following reasons: 1) mainstream two-stage methods rely on additional annotations or off-the-shelf detectors to generate proposals. It would heavily degrade the generalization ability of models and lead to inevitable error accumulation. 2) Although one-stage strategies for REC have been proposed, these methods have to depend on lots of hyper-parameters (such as anchors) to generate bounding box. In this paper, we present a proposal-free one-stage (PFOS) framework that can directly regress the region-of-interest from the image or generate unambiguous descriptions in an end-to-end manner. Instead of using the dominant two-stage fashion, we take the dense-grid of images as input for a cross-attention transformer that learns multi-modal correspondences. The final bounding box or sentence is directly predicted from the image without the anchor selection or the computation of visual difference. Furthermore, we expand the traditional two-stage listener-speaker framework to jointly train by a one-stage learning paradigm. Our model achieves state-of-the-art performance on both accuracy and speed for comprehension and competitive results for generation.

AB - Referring Expression Comprehension (REC) and Generation (REG) have become one of the most important tasks in visual reasoning, since it is an essential step for many vision-and-language tasks such as visual question answering or visual dialogue. However, it has not been widely used in many downstream tasks, mainly for the following reasons: 1) mainstream two-stage methods rely on additional annotations or off-the-shelf detectors to generate proposals. It would heavily degrade the generalization ability of models and lead to inevitable error accumulation. 2) Although one-stage strategies for REC have been proposed, these methods have to depend on lots of hyper-parameters (such as anchors) to generate bounding box. In this paper, we present a proposal-free one-stage (PFOS) framework that can directly regress the region-of-interest from the image or generate unambiguous descriptions in an end-to-end manner. Instead of using the dominant two-stage fashion, we take the dense-grid of images as input for a cross-attention transformer that learns multi-modal correspondences. The final bounding box or sentence is directly predicted from the image without the anchor selection or the computation of visual difference. Furthermore, we expand the traditional two-stage listener-speaker framework to jointly train by a one-stage learning paradigm. Our model achieves state-of-the-art performance on both accuracy and speed for comprehension and competitive results for generation.

KW - one-stage method

KW - Referring expression comprehension

KW - referring expression generation

UR - http://www.scopus.com/inward/record.url?scp=85124223753&partnerID=8YFLogxK

U2 - 10.1109/TMM.2022.3147385

DO - 10.1109/TMM.2022.3147385

M3 - 文章

AN - SCOPUS:85124223753

SN - 1520-9210

VL - 25

SP - 2446

EP - 2458

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

ER -

A Proposal-Free One-Stage Framework for Referring Expression Comprehension and Generation via Dense Cross-Attention

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this