Visual Scene-Aware Dialogue System for Cross-Modal Intelligent Human-Machine Interaction

Feiyang Liu, Bin Guo, Hao Wang, Yan Liu

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Adequate perception and understanding of the user’s visual context is an important part of a robot’s ability to interact naturally with humans and achieve true anthropomorphism. In this paper, we focus on the emerging field of visual scene-aware dialogue system for cross-modal intelligent human-machine interaction, which faces the following challenges: (1) The video content has complex dynamic changes in both temporal and spatial semantic space, which makes it difficult to extract accurate visual semantic information; (2) The user’s attention during multiple rounds of dialogue usually involves objects at different spatial positions in different video clips. This makes it necessary for the dialogue agent to have fine-grained reasoning capabilities to understand the user dialogue context; (3) There is information redundancy and complementary among multi-modal features, which requires reasonable processing of multi-modal information to enable the dialogue agent to gain a comprehensive understanding of the dialogue scene. To address the above challenges, this paper proposes a Transformer-based neural network framework to extract fine-grained visual semantic information through space-to-time and time-to-space bidirectional inference; and proposes a multi-modal fusion method based on the cross-attention framework, which enables multi-modal features to be fully interacted and fused in a cross manner. The experimental results show that compared with the baseline model, the model in this paper improves 39.5%, 32.1%, 19.7%, and 61.3% in the four metrics of BLEU, METEOR, ROUGE-L, and CIDEr, which represent the fluency, accuracy, adequacy, and recall of the generated conversation contents, respectively.

Original languageEnglish
Title of host publicationComputer Supported Cooperative Work and Social Computing - 17th CCF Conference, ChineseCSCW 2022, Revised Selected Papers
EditorsYuqing Sun, Tun Lu, Yinzhang Guo, Xiaoxia Song, Hongfei Fan, Dongning Liu, Liping Gao, Bowen Du
PublisherSpringer Science and Business Media Deutschland GmbH
Pages337-351
Number of pages15
ISBN (Print)9789819923847
DOIs
StatePublished - 2023
Event17th CCF Conference on Computer Supported Cooperative Work and Social Computing, ChineseCSCW 2022 - Taiyuan, China
Duration: 25 Nov 202227 Nov 2022

Publication series

NameCommunications in Computer and Information Science
Volume1682 CCIS
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference17th CCF Conference on Computer Supported Cooperative Work and Social Computing, ChineseCSCW 2022
Country/TerritoryChina
CityTaiyuan
Period25/11/2227/11/22

Keywords

  • Cross-attention mechanism
  • Human-machine dialogue
  • Human-machine interaction
  • Scene awareness
  • Spatial-temporal reasoning

Fingerprint

Dive into the research topics of 'Visual Scene-Aware Dialogue System for Cross-Modal Intelligent Human-Machine Interaction'. Together they form a unique fingerprint.

Cite this