TY - JOUR
T1 - Visual question answering model based on visual relationship detection
AU - Xi, Yuling
AU - Zhang, Yanning
AU - Ding, Songtao
AU - Wan, Shaohua
N1 - Publisher Copyright:
© 2019
PY - 2020/2
Y1 - 2020/2
N2 - visual question answering (VQA) is a learning task involving two major fields of computer vision and natural language processing. The development of deep learning technology has contributed to the advancement of this research area. Although the research on the question answering model has made great progress, the low accuracy of the VQA model is mainly because the current question answering model structure is relatively simple, the attention mechanism of model is deviated from human attention and lacks a higher level of logical reasoning ability. In response to the above problems, we propose a VQA model based on multi-objective visual relationship detection. Firstly, the appearance feature is used to replace the image features from the original object, and the appearance model is extended by the principle of word vector similarity. The appearance features and relationship predicates are then fed into the word vector space and represented by a fixed length vector. Finally, through the concatenation of elements between the image feature and the question vector are fed into the classifier to generate an output answer. Our method is benchmarked on the DQAUAR data set, and evaluated by the Acc WUPS@0.0 and WUPS@0.9.
AB - visual question answering (VQA) is a learning task involving two major fields of computer vision and natural language processing. The development of deep learning technology has contributed to the advancement of this research area. Although the research on the question answering model has made great progress, the low accuracy of the VQA model is mainly because the current question answering model structure is relatively simple, the attention mechanism of model is deviated from human attention and lacks a higher level of logical reasoning ability. In response to the above problems, we propose a VQA model based on multi-objective visual relationship detection. Firstly, the appearance feature is used to replace the image features from the original object, and the appearance model is extended by the principle of word vector similarity. The appearance features and relationship predicates are then fed into the word vector space and represented by a fixed length vector. Finally, through the concatenation of elements between the image feature and the question vector are fed into the classifier to generate an output answer. Our method is benchmarked on the DQAUAR data set, and evaluated by the Acc WUPS@0.0 and WUPS@0.9.
KW - Appearance features
KW - Relationship predicate
KW - Visual question answering
KW - Word vector similarity
UR - http://www.scopus.com/inward/record.url?scp=85072767554&partnerID=8YFLogxK
U2 - 10.1016/j.image.2019.115648
DO - 10.1016/j.image.2019.115648
M3 - 文章
AN - SCOPUS:85072767554
SN - 0923-5965
VL - 80
JO - Signal Processing: Image Communication
JF - Signal Processing: Image Communication
M1 - 115648
ER -