TY - JOUR
T1 - A Proposal-Free One-Stage Framework for Referring Expression Comprehension and Generation via Dense Cross-Attention
AU - Sun, Mengyang
AU - Suo, Wei
AU - Wang, Peng
AU - Zhang, Yanning
AU - Wu, Qi
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2023
Y1 - 2023
N2 - Referring Expression Comprehension (REC) and Generation (REG) have become one of the most important tasks in visual reasoning, since it is an essential step for many vision-and-language tasks such as visual question answering or visual dialogue. However, it has not been widely used in many downstream tasks, mainly for the following reasons: 1) mainstream two-stage methods rely on additional annotations or off-the-shelf detectors to generate proposals. It would heavily degrade the generalization ability of models and lead to inevitable error accumulation. 2) Although one-stage strategies for REC have been proposed, these methods have to depend on lots of hyper-parameters (such as anchors) to generate bounding box. In this paper, we present a proposal-free one-stage (PFOS) framework that can directly regress the region-of-interest from the image or generate unambiguous descriptions in an end-to-end manner. Instead of using the dominant two-stage fashion, we take the dense-grid of images as input for a cross-attention transformer that learns multi-modal correspondences. The final bounding box or sentence is directly predicted from the image without the anchor selection or the computation of visual difference. Furthermore, we expand the traditional two-stage listener-speaker framework to jointly train by a one-stage learning paradigm. Our model achieves state-of-the-art performance on both accuracy and speed for comprehension and competitive results for generation.
AB - Referring Expression Comprehension (REC) and Generation (REG) have become one of the most important tasks in visual reasoning, since it is an essential step for many vision-and-language tasks such as visual question answering or visual dialogue. However, it has not been widely used in many downstream tasks, mainly for the following reasons: 1) mainstream two-stage methods rely on additional annotations or off-the-shelf detectors to generate proposals. It would heavily degrade the generalization ability of models and lead to inevitable error accumulation. 2) Although one-stage strategies for REC have been proposed, these methods have to depend on lots of hyper-parameters (such as anchors) to generate bounding box. In this paper, we present a proposal-free one-stage (PFOS) framework that can directly regress the region-of-interest from the image or generate unambiguous descriptions in an end-to-end manner. Instead of using the dominant two-stage fashion, we take the dense-grid of images as input for a cross-attention transformer that learns multi-modal correspondences. The final bounding box or sentence is directly predicted from the image without the anchor selection or the computation of visual difference. Furthermore, we expand the traditional two-stage listener-speaker framework to jointly train by a one-stage learning paradigm. Our model achieves state-of-the-art performance on both accuracy and speed for comprehension and competitive results for generation.
KW - one-stage method
KW - Referring expression comprehension
KW - referring expression generation
UR - http://www.scopus.com/inward/record.url?scp=85124223753&partnerID=8YFLogxK
U2 - 10.1109/TMM.2022.3147385
DO - 10.1109/TMM.2022.3147385
M3 - 文章
AN - SCOPUS:85124223753
SN - 1520-9210
VL - 25
SP - 2446
EP - 2458
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
ER -