TY - GEN
T1 - Visual Scene-Aware Dialogue System for Cross-Modal Intelligent Human-Machine Interaction
AU - Liu, Feiyang
AU - Guo, Bin
AU - Wang, Hao
AU - Liu, Yan
N1 - Publisher Copyright:
© 2023, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
PY - 2023
Y1 - 2023
N2 - Adequate perception and understanding of the user’s visual context is an important part of a robot’s ability to interact naturally with humans and achieve true anthropomorphism. In this paper, we focus on the emerging field of visual scene-aware dialogue system for cross-modal intelligent human-machine interaction, which faces the following challenges: (1) The video content has complex dynamic changes in both temporal and spatial semantic space, which makes it difficult to extract accurate visual semantic information; (2) The user’s attention during multiple rounds of dialogue usually involves objects at different spatial positions in different video clips. This makes it necessary for the dialogue agent to have fine-grained reasoning capabilities to understand the user dialogue context; (3) There is information redundancy and complementary among multi-modal features, which requires reasonable processing of multi-modal information to enable the dialogue agent to gain a comprehensive understanding of the dialogue scene. To address the above challenges, this paper proposes a Transformer-based neural network framework to extract fine-grained visual semantic information through space-to-time and time-to-space bidirectional inference; and proposes a multi-modal fusion method based on the cross-attention framework, which enables multi-modal features to be fully interacted and fused in a cross manner. The experimental results show that compared with the baseline model, the model in this paper improves 39.5%, 32.1%, 19.7%, and 61.3% in the four metrics of BLEU, METEOR, ROUGE-L, and CIDEr, which represent the fluency, accuracy, adequacy, and recall of the generated conversation contents, respectively.
AB - Adequate perception and understanding of the user’s visual context is an important part of a robot’s ability to interact naturally with humans and achieve true anthropomorphism. In this paper, we focus on the emerging field of visual scene-aware dialogue system for cross-modal intelligent human-machine interaction, which faces the following challenges: (1) The video content has complex dynamic changes in both temporal and spatial semantic space, which makes it difficult to extract accurate visual semantic information; (2) The user’s attention during multiple rounds of dialogue usually involves objects at different spatial positions in different video clips. This makes it necessary for the dialogue agent to have fine-grained reasoning capabilities to understand the user dialogue context; (3) There is information redundancy and complementary among multi-modal features, which requires reasonable processing of multi-modal information to enable the dialogue agent to gain a comprehensive understanding of the dialogue scene. To address the above challenges, this paper proposes a Transformer-based neural network framework to extract fine-grained visual semantic information through space-to-time and time-to-space bidirectional inference; and proposes a multi-modal fusion method based on the cross-attention framework, which enables multi-modal features to be fully interacted and fused in a cross manner. The experimental results show that compared with the baseline model, the model in this paper improves 39.5%, 32.1%, 19.7%, and 61.3% in the four metrics of BLEU, METEOR, ROUGE-L, and CIDEr, which represent the fluency, accuracy, adequacy, and recall of the generated conversation contents, respectively.
KW - Cross-attention mechanism
KW - Human-machine dialogue
KW - Human-machine interaction
KW - Scene awareness
KW - Spatial-temporal reasoning
UR - http://www.scopus.com/inward/record.url?scp=85161132515&partnerID=8YFLogxK
U2 - 10.1007/978-981-99-2385-4_25
DO - 10.1007/978-981-99-2385-4_25
M3 - 会议稿件
AN - SCOPUS:85161132515
SN - 9789819923847
T3 - Communications in Computer and Information Science
SP - 337
EP - 351
BT - Computer Supported Cooperative Work and Social Computing - 17th CCF Conference, ChineseCSCW 2022, Revised Selected Papers
A2 - Sun, Yuqing
A2 - Lu, Tun
A2 - Guo, Yinzhang
A2 - Song, Xiaoxia
A2 - Fan, Hongfei
A2 - Liu, Dongning
A2 - Gao, Liping
A2 - Du, Bowen
PB - Springer Science and Business Media Deutschland GmbH
T2 - 17th CCF Conference on Computer Supported Cooperative Work and Social Computing, ChineseCSCW 2022
Y2 - 25 November 2022 through 27 November 2022
ER -