Visual Scene-Aware Dialogue System for Cross-Modal Intelligent Human-Machine Interaction

Feiyang Liu; Bin Guo; Hao Wang; Yan Liu

doi:10.1007/978-981-99-2385-4_25

Visual Scene-Aware Dialogue System for Cross-Modal Intelligent Human-Machine Interaction

Feiyang Liu, Bin Guo, Hao Wang, Yan Liu

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

1 Scopus citations

Abstract

Adequate perception and understanding of the user’s visual context is an important part of a robot’s ability to interact naturally with humans and achieve true anthropomorphism. In this paper, we focus on the emerging field of visual scene-aware dialogue system for cross-modal intelligent human-machine interaction, which faces the following challenges: (1) The video content has complex dynamic changes in both temporal and spatial semantic space, which makes it difficult to extract accurate visual semantic information; (2) The user’s attention during multiple rounds of dialogue usually involves objects at different spatial positions in different video clips. This makes it necessary for the dialogue agent to have fine-grained reasoning capabilities to understand the user dialogue context; (3) There is information redundancy and complementary among multi-modal features, which requires reasonable processing of multi-modal information to enable the dialogue agent to gain a comprehensive understanding of the dialogue scene. To address the above challenges, this paper proposes a Transformer-based neural network framework to extract fine-grained visual semantic information through space-to-time and time-to-space bidirectional inference; and proposes a multi-modal fusion method based on the cross-attention framework, which enables multi-modal features to be fully interacted and fused in a cross manner. The experimental results show that compared with the baseline model, the model in this paper improves 39.5%, 32.1%, 19.7%, and 61.3% in the four metrics of BLEU, METEOR, ROUGE-L, and CIDEr, which represent the fluency, accuracy, adequacy, and recall of the generated conversation contents, respectively.

Original language	English
Title of host publication	Computer Supported Cooperative Work and Social Computing - 17th CCF Conference, ChineseCSCW 2022, Revised Selected Papers
Editors	Yuqing Sun, Tun Lu, Yinzhang Guo, Xiaoxia Song, Hongfei Fan, Dongning Liu, Liping Gao, Bowen Du
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	337-351
Number of pages	15
ISBN (Print)	9789819923847
DOIs	https://doi.org/10.1007/978-981-99-2385-4_25
State	Published - 2023
Event	17th CCF Conference on Computer Supported Cooperative Work and Social Computing, ChineseCSCW 2022 - Taiyuan, China Duration: 25 Nov 2022 → 27 Nov 2022

Publication series

Name	Communications in Computer and Information Science
Volume	1682 CCIS
ISSN (Print)	1865-0929
ISSN (Electronic)	1865-0937

Conference

Conference	17th CCF Conference on Computer Supported Cooperative Work and Social Computing, ChineseCSCW 2022
Country/Territory	China
City	Taiyuan
Period	25/11/22 → 27/11/22

Keywords

Cross-attention mechanism
Human-machine dialogue
Human-machine interaction
Scene awareness
Spatial-temporal reasoning

Access to Document

10.1007/978-981-99-2385-4_25

Cite this

Liu, F., Guo, B., Wang, H., & Liu, Y. (2023). Visual Scene-Aware Dialogue System for Cross-Modal Intelligent Human-Machine Interaction. In Y. Sun, T. Lu, Y. Guo, X. Song, H. Fan, D. Liu, L. Gao, & B. Du (Eds.), Computer Supported Cooperative Work and Social Computing - 17th CCF Conference, ChineseCSCW 2022, Revised Selected Papers (pp. 337-351). (Communications in Computer and Information Science; Vol. 1682 CCIS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-99-2385-4_25

Liu, Feiyang ; Guo, Bin ; Wang, Hao et al. / Visual Scene-Aware Dialogue System for Cross-Modal Intelligent Human-Machine Interaction. Computer Supported Cooperative Work and Social Computing - 17th CCF Conference, ChineseCSCW 2022, Revised Selected Papers. editor / Yuqing Sun ; Tun Lu ; Yinzhang Guo ; Xiaoxia Song ; Hongfei Fan ; Dongning Liu ; Liping Gao ; Bowen Du. Springer Science and Business Media Deutschland GmbH, 2023. pp. 337-351 (Communications in Computer and Information Science).

@inproceedings{765dc59ad10a4a2090dc911aba86e2b7,

title = "Visual Scene-Aware Dialogue System for Cross-Modal Intelligent Human-Machine Interaction",

abstract = "Adequate perception and understanding of the user{\textquoteright}s visual context is an important part of a robot{\textquoteright}s ability to interact naturally with humans and achieve true anthropomorphism. In this paper, we focus on the emerging field of visual scene-aware dialogue system for cross-modal intelligent human-machine interaction, which faces the following challenges: (1) The video content has complex dynamic changes in both temporal and spatial semantic space, which makes it difficult to extract accurate visual semantic information; (2) The user{\textquoteright}s attention during multiple rounds of dialogue usually involves objects at different spatial positions in different video clips. This makes it necessary for the dialogue agent to have fine-grained reasoning capabilities to understand the user dialogue context; (3) There is information redundancy and complementary among multi-modal features, which requires reasonable processing of multi-modal information to enable the dialogue agent to gain a comprehensive understanding of the dialogue scene. To address the above challenges, this paper proposes a Transformer-based neural network framework to extract fine-grained visual semantic information through space-to-time and time-to-space bidirectional inference; and proposes a multi-modal fusion method based on the cross-attention framework, which enables multi-modal features to be fully interacted and fused in a cross manner. The experimental results show that compared with the baseline model, the model in this paper improves 39.5%, 32.1%, 19.7%, and 61.3% in the four metrics of BLEU, METEOR, ROUGE-L, and CIDEr, which represent the fluency, accuracy, adequacy, and recall of the generated conversation contents, respectively.",

keywords = "Cross-attention mechanism, Human-machine dialogue, Human-machine interaction, Scene awareness, Spatial-temporal reasoning",

author = "Feiyang Liu and Bin Guo and Hao Wang and Yan Liu",

note = "Publisher Copyright: {\textcopyright} 2023, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.; 17th CCF Conference on Computer Supported Cooperative Work and Social Computing, ChineseCSCW 2022 ; Conference date: 25-11-2022 Through 27-11-2022",

year = "2023",

doi = "10.1007/978-981-99-2385-4_25",

language = "英语",

isbn = "9789819923847",

series = "Communications in Computer and Information Science",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "337--351",

editor = "Yuqing Sun and Tun Lu and Yinzhang Guo and Xiaoxia Song and Hongfei Fan and Dongning Liu and Liping Gao and Bowen Du",

booktitle = "Computer Supported Cooperative Work and Social Computing - 17th CCF Conference, ChineseCSCW 2022, Revised Selected Papers",

}

Liu, F, Guo, B, Wang, H & Liu, Y 2023, Visual Scene-Aware Dialogue System for Cross-Modal Intelligent Human-Machine Interaction. in Y Sun, T Lu, Y Guo, X Song, H Fan, D Liu, L Gao & B Du (eds), Computer Supported Cooperative Work and Social Computing - 17th CCF Conference, ChineseCSCW 2022, Revised Selected Papers. Communications in Computer and Information Science, vol. 1682 CCIS, Springer Science and Business Media Deutschland GmbH, pp. 337-351, 17th CCF Conference on Computer Supported Cooperative Work and Social Computing, ChineseCSCW 2022, Taiyuan, China, 25/11/22. https://doi.org/10.1007/978-981-99-2385-4_25

Visual Scene-Aware Dialogue System for Cross-Modal Intelligent Human-Machine Interaction. / Liu, Feiyang; Guo, Bin; Wang, Hao et al.
Computer Supported Cooperative Work and Social Computing - 17th CCF Conference, ChineseCSCW 2022, Revised Selected Papers. ed. / Yuqing Sun; Tun Lu; Yinzhang Guo; Xiaoxia Song; Hongfei Fan; Dongning Liu; Liping Gao; Bowen Du. Springer Science and Business Media Deutschland GmbH, 2023. p. 337-351 (Communications in Computer and Information Science; Vol. 1682 CCIS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Visual Scene-Aware Dialogue System for Cross-Modal Intelligent Human-Machine Interaction

AU - Liu, Feiyang

AU - Guo, Bin

AU - Wang, Hao

AU - Liu, Yan

PY - 2023

Y1 - 2023

N2 - Adequate perception and understanding of the user’s visual context is an important part of a robot’s ability to interact naturally with humans and achieve true anthropomorphism. In this paper, we focus on the emerging field of visual scene-aware dialogue system for cross-modal intelligent human-machine interaction, which faces the following challenges: (1) The video content has complex dynamic changes in both temporal and spatial semantic space, which makes it difficult to extract accurate visual semantic information; (2) The user’s attention during multiple rounds of dialogue usually involves objects at different spatial positions in different video clips. This makes it necessary for the dialogue agent to have fine-grained reasoning capabilities to understand the user dialogue context; (3) There is information redundancy and complementary among multi-modal features, which requires reasonable processing of multi-modal information to enable the dialogue agent to gain a comprehensive understanding of the dialogue scene. To address the above challenges, this paper proposes a Transformer-based neural network framework to extract fine-grained visual semantic information through space-to-time and time-to-space bidirectional inference; and proposes a multi-modal fusion method based on the cross-attention framework, which enables multi-modal features to be fully interacted and fused in a cross manner. The experimental results show that compared with the baseline model, the model in this paper improves 39.5%, 32.1%, 19.7%, and 61.3% in the four metrics of BLEU, METEOR, ROUGE-L, and CIDEr, which represent the fluency, accuracy, adequacy, and recall of the generated conversation contents, respectively.

AB - Adequate perception and understanding of the user’s visual context is an important part of a robot’s ability to interact naturally with humans and achieve true anthropomorphism. In this paper, we focus on the emerging field of visual scene-aware dialogue system for cross-modal intelligent human-machine interaction, which faces the following challenges: (1) The video content has complex dynamic changes in both temporal and spatial semantic space, which makes it difficult to extract accurate visual semantic information; (2) The user’s attention during multiple rounds of dialogue usually involves objects at different spatial positions in different video clips. This makes it necessary for the dialogue agent to have fine-grained reasoning capabilities to understand the user dialogue context; (3) There is information redundancy and complementary among multi-modal features, which requires reasonable processing of multi-modal information to enable the dialogue agent to gain a comprehensive understanding of the dialogue scene. To address the above challenges, this paper proposes a Transformer-based neural network framework to extract fine-grained visual semantic information through space-to-time and time-to-space bidirectional inference; and proposes a multi-modal fusion method based on the cross-attention framework, which enables multi-modal features to be fully interacted and fused in a cross manner. The experimental results show that compared with the baseline model, the model in this paper improves 39.5%, 32.1%, 19.7%, and 61.3% in the four metrics of BLEU, METEOR, ROUGE-L, and CIDEr, which represent the fluency, accuracy, adequacy, and recall of the generated conversation contents, respectively.

KW - Cross-attention mechanism

KW - Human-machine dialogue

KW - Human-machine interaction

KW - Scene awareness

KW - Spatial-temporal reasoning

UR - http://www.scopus.com/inward/record.url?scp=85161132515&partnerID=8YFLogxK

U2 - 10.1007/978-981-99-2385-4_25

DO - 10.1007/978-981-99-2385-4_25

M3 - 会议稿件

AN - SCOPUS:85161132515

SN - 9789819923847

T3 - Communications in Computer and Information Science

SP - 337

EP - 351

BT - Computer Supported Cooperative Work and Social Computing - 17th CCF Conference, ChineseCSCW 2022, Revised Selected Papers

A2 - Sun, Yuqing

A2 - Lu, Tun

A2 - Guo, Yinzhang

A2 - Song, Xiaoxia

A2 - Fan, Hongfei

A2 - Liu, Dongning

A2 - Gao, Liping

A2 - Du, Bowen

PB - Springer Science and Business Media Deutschland GmbH

T2 - 17th CCF Conference on Computer Supported Cooperative Work and Social Computing, ChineseCSCW 2022

Y2 - 25 November 2022 through 27 November 2022

ER -

Liu F, Guo B, Wang H, Liu Y. Visual Scene-Aware Dialogue System for Cross-Modal Intelligent Human-Machine Interaction. In Sun Y, Lu T, Guo Y, Song X, Fan H, Liu D, Gao L, Du B, editors, Computer Supported Cooperative Work and Social Computing - 17th CCF Conference, ChineseCSCW 2022, Revised Selected Papers. Springer Science and Business Media Deutschland GmbH. 2023. p. 337-351. (Communications in Computer and Information Science). doi: 10.1007/978-981-99-2385-4_25

Visual Scene-Aware Dialogue System for Cross-Modal Intelligent Human-Machine Interaction

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this