Visual Scene-Aware Dialogue System for Cross-Modal Intelligent Human-Machine Interaction

Feiyang Liu, Bin Guo, Hao Wang, Yan Liu

科研成果: 书/报告/会议事项章节会议稿件同行评审

1 引用 (Scopus)

摘要

Adequate perception and understanding of the user’s visual context is an important part of a robot’s ability to interact naturally with humans and achieve true anthropomorphism. In this paper, we focus on the emerging field of visual scene-aware dialogue system for cross-modal intelligent human-machine interaction, which faces the following challenges: (1) The video content has complex dynamic changes in both temporal and spatial semantic space, which makes it difficult to extract accurate visual semantic information; (2) The user’s attention during multiple rounds of dialogue usually involves objects at different spatial positions in different video clips. This makes it necessary for the dialogue agent to have fine-grained reasoning capabilities to understand the user dialogue context; (3) There is information redundancy and complementary among multi-modal features, which requires reasonable processing of multi-modal information to enable the dialogue agent to gain a comprehensive understanding of the dialogue scene. To address the above challenges, this paper proposes a Transformer-based neural network framework to extract fine-grained visual semantic information through space-to-time and time-to-space bidirectional inference; and proposes a multi-modal fusion method based on the cross-attention framework, which enables multi-modal features to be fully interacted and fused in a cross manner. The experimental results show that compared with the baseline model, the model in this paper improves 39.5%, 32.1%, 19.7%, and 61.3% in the four metrics of BLEU, METEOR, ROUGE-L, and CIDEr, which represent the fluency, accuracy, adequacy, and recall of the generated conversation contents, respectively.

源语言英语
主期刊名Computer Supported Cooperative Work and Social Computing - 17th CCF Conference, ChineseCSCW 2022, Revised Selected Papers
编辑Yuqing Sun, Tun Lu, Yinzhang Guo, Xiaoxia Song, Hongfei Fan, Dongning Liu, Liping Gao, Bowen Du
出版商Springer Science and Business Media Deutschland GmbH
337-351
页数15
ISBN(印刷版)9789819923847
DOI
出版状态已出版 - 2023
活动17th CCF Conference on Computer Supported Cooperative Work and Social Computing, ChineseCSCW 2022 - Taiyuan, 中国
期限: 25 11月 202227 11月 2022

出版系列

姓名Communications in Computer and Information Science
1682 CCIS
ISSN(印刷版)1865-0929
ISSN(电子版)1865-0937

会议

会议17th CCF Conference on Computer Supported Cooperative Work and Social Computing, ChineseCSCW 2022
国家/地区中国
Taiyuan
时期25/11/2227/11/22

指纹

探究 'Visual Scene-Aware Dialogue System for Cross-Modal Intelligent Human-Machine Interaction' 的科研主题。它们共同构成独一无二的指纹。

引用此