Visual question answering model based on visual relationship detection

Yuling Xi; Yanning Zhang; Songtao Ding; Shaohua Wan

doi:10.1016/j.image.2019.115648

Visual question answering model based on visual relationship detection

Yuling Xi, Yanning Zhang, Songtao Ding, Shaohua Wan

School of Computer Science

Research output: Contribution to journal › Article › peer-review

87 Scopus citations

Abstract

visual question answering (VQA) is a learning task involving two major fields of computer vision and natural language processing. The development of deep learning technology has contributed to the advancement of this research area. Although the research on the question answering model has made great progress, the low accuracy of the VQA model is mainly because the current question answering model structure is relatively simple, the attention mechanism of model is deviated from human attention and lacks a higher level of logical reasoning ability. In response to the above problems, we propose a VQA model based on multi-objective visual relationship detection. Firstly, the appearance feature is used to replace the image features from the original object, and the appearance model is extended by the principle of word vector similarity. The appearance features and relationship predicates are then fed into the word vector space and represented by a fixed length vector. Finally, through the concatenation of elements between the image feature and the question vector are fed into the classifier to generate an output answer. Our method is benchmarked on the DQAUAR data set, and evaluated by the Acc WUPS@0.0 and WUPS@0.9.

Original language	English
Article number	115648
Journal	Signal Processing: Image Communication
Volume	80
DOIs	https://doi.org/10.1016/j.image.2019.115648
State	Published - Feb 2020

Keywords

Appearance features
Relationship predicate
Visual question answering
Word vector similarity

Access to Document

10.1016/j.image.2019.115648

Cite this

@article{9e76534e64e0427fa607696f1988e974,

title = "Visual question answering model based on visual relationship detection",

abstract = "visual question answering (VQA) is a learning task involving two major fields of computer vision and natural language processing. The development of deep learning technology has contributed to the advancement of this research area. Although the research on the question answering model has made great progress, the low accuracy of the VQA model is mainly because the current question answering model structure is relatively simple, the attention mechanism of model is deviated from human attention and lacks a higher level of logical reasoning ability. In response to the above problems, we propose a VQA model based on multi-objective visual relationship detection. Firstly, the appearance feature is used to replace the image features from the original object, and the appearance model is extended by the principle of word vector similarity. The appearance features and relationship predicates are then fed into the word vector space and represented by a fixed length vector. Finally, through the concatenation of elements between the image feature and the question vector are fed into the classifier to generate an output answer. Our method is benchmarked on the DQAUAR data set, and evaluated by the Acc WUPS@0.0 and WUPS@0.9.",

keywords = "Appearance features, Relationship predicate, Visual question answering, Word vector similarity",

author = "Yuling Xi and Yanning Zhang and Songtao Ding and Shaohua Wan",

note = "Publisher Copyright: {\textcopyright} 2019",

year = "2020",

month = feb,

doi = "10.1016/j.image.2019.115648",

language = "英语",

volume = "80",

journal = "Signal Processing: Image Communication",

issn = "0923-5965",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Visual question answering model based on visual relationship detection

AU - Xi, Yuling

AU - Zhang, Yanning

AU - Ding, Songtao

AU - Wan, Shaohua

PY - 2020/2

Y1 - 2020/2

N2 - visual question answering (VQA) is a learning task involving two major fields of computer vision and natural language processing. The development of deep learning technology has contributed to the advancement of this research area. Although the research on the question answering model has made great progress, the low accuracy of the VQA model is mainly because the current question answering model structure is relatively simple, the attention mechanism of model is deviated from human attention and lacks a higher level of logical reasoning ability. In response to the above problems, we propose a VQA model based on multi-objective visual relationship detection. Firstly, the appearance feature is used to replace the image features from the original object, and the appearance model is extended by the principle of word vector similarity. The appearance features and relationship predicates are then fed into the word vector space and represented by a fixed length vector. Finally, through the concatenation of elements between the image feature and the question vector are fed into the classifier to generate an output answer. Our method is benchmarked on the DQAUAR data set, and evaluated by the Acc WUPS@0.0 and WUPS@0.9.

AB - visual question answering (VQA) is a learning task involving two major fields of computer vision and natural language processing. The development of deep learning technology has contributed to the advancement of this research area. Although the research on the question answering model has made great progress, the low accuracy of the VQA model is mainly because the current question answering model structure is relatively simple, the attention mechanism of model is deviated from human attention and lacks a higher level of logical reasoning ability. In response to the above problems, we propose a VQA model based on multi-objective visual relationship detection. Firstly, the appearance feature is used to replace the image features from the original object, and the appearance model is extended by the principle of word vector similarity. The appearance features and relationship predicates are then fed into the word vector space and represented by a fixed length vector. Finally, through the concatenation of elements between the image feature and the question vector are fed into the classifier to generate an output answer. Our method is benchmarked on the DQAUAR data set, and evaluated by the Acc WUPS@0.0 and WUPS@0.9.

KW - Appearance features

KW - Relationship predicate

KW - Visual question answering

KW - Word vector similarity

UR - http://www.scopus.com/inward/record.url?scp=85072767554&partnerID=8YFLogxK

U2 - 10.1016/j.image.2019.115648

DO - 10.1016/j.image.2019.115648

M3 - 文章

AN - SCOPUS:85072767554

SN - 0923-5965

VL - 80

JO - Signal Processing: Image Communication

JF - Signal Processing: Image Communication

M1 - 115648

ER -

Visual question answering model based on visual relationship detection

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this