Visual Scene-Aware Dialogue System for Cross-Modal Intelligent Human-Machine Interaction

Feiyang Liu; Bin Guo; Hao Wang; Yan Liu

doi:10.1007/978-981-99-2385-4_25

Visual Scene-Aware Dialogue System for Cross-Modal Intelligent Human-Machine Interaction

Feiyang Liu, Bin Guo, Hao Wang, Yan Liu

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

1 引用（Scopus）

摘要

Adequate perception and understanding of the user’s visual context is an important part of a robot’s ability to interact naturally with humans and achieve true anthropomorphism. In this paper, we focus on the emerging field of visual scene-aware dialogue system for cross-modal intelligent human-machine interaction, which faces the following challenges: (1) The video content has complex dynamic changes in both temporal and spatial semantic space, which makes it difficult to extract accurate visual semantic information; (2) The user’s attention during multiple rounds of dialogue usually involves objects at different spatial positions in different video clips. This makes it necessary for the dialogue agent to have fine-grained reasoning capabilities to understand the user dialogue context; (3) There is information redundancy and complementary among multi-modal features, which requires reasonable processing of multi-modal information to enable the dialogue agent to gain a comprehensive understanding of the dialogue scene. To address the above challenges, this paper proposes a Transformer-based neural network framework to extract fine-grained visual semantic information through space-to-time and time-to-space bidirectional inference; and proposes a multi-modal fusion method based on the cross-attention framework, which enables multi-modal features to be fully interacted and fused in a cross manner. The experimental results show that compared with the baseline model, the model in this paper improves 39.5%, 32.1%, 19.7%, and 61.3% in the four metrics of BLEU, METEOR, ROUGE-L, and CIDEr, which represent the fluency, accuracy, adequacy, and recall of the generated conversation contents, respectively.

源语言	英语
主期刊名	Computer Supported Cooperative Work and Social Computing - 17th CCF Conference, ChineseCSCW 2022, Revised Selected Papers
编辑	Yuqing Sun, Tun Lu, Yinzhang Guo, Xiaoxia Song, Hongfei Fan, Dongning Liu, Liping Gao, Bowen Du
出版商	Springer Science and Business Media Deutschland GmbH
页	337-351
页数	15
ISBN（印刷版）	9789819923847
DOI	https://doi.org/10.1007/978-981-99-2385-4_25
出版状态	已出版 - 2023
活动	17th CCF Conference on Computer Supported Cooperative Work and Social Computing, ChineseCSCW 2022 - Taiyuan, 中国期限: 25 11月 2022 → 27 11月 2022

出版系列

姓名	Communications in Computer and Information Science
卷	1682 CCIS
ISSN（印刷版）	1865-0929
ISSN（电子版）	1865-0937

会议

会议	17th CCF Conference on Computer Supported Cooperative Work and Social Computing, ChineseCSCW 2022
国家/地区	中国
市	Taiyuan
时期	25/11/22 → 27/11/22

访问文件

10.1007/978-981-99-2385-4_25

其它文件与链接

链接到 Scopus 的出版物

引用此

Liu, F., Guo, B., Wang, H., & Liu, Y. (2023). Visual Scene-Aware Dialogue System for Cross-Modal Intelligent Human-Machine Interaction. 在 Y. Sun, T. Lu, Y. Guo, X. Song, H. Fan, D. Liu, L. Gao, & B. Du (编辑), Computer Supported Cooperative Work and Social Computing - 17th CCF Conference, ChineseCSCW 2022, Revised Selected Papers (页码 337-351). (Communications in Computer and Information Science; 卷 1682 CCIS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-99-2385-4_25

Liu, Feiyang ; Guo, Bin ; Wang, Hao 等. / Visual Scene-Aware Dialogue System for Cross-Modal Intelligent Human-Machine Interaction. Computer Supported Cooperative Work and Social Computing - 17th CCF Conference, ChineseCSCW 2022, Revised Selected Papers. 编辑 / Yuqing Sun ; Tun Lu ; Yinzhang Guo ; Xiaoxia Song ; Hongfei Fan ; Dongning Liu ; Liping Gao ; Bowen Du. Springer Science and Business Media Deutschland GmbH, 2023. 页码 337-351 (Communications in Computer and Information Science).

@inproceedings{765dc59ad10a4a2090dc911aba86e2b7,

title = "Visual Scene-Aware Dialogue System for Cross-Modal Intelligent Human-Machine Interaction",

abstract = "Adequate perception and understanding of the user{\textquoteright}s visual context is an important part of a robot{\textquoteright}s ability to interact naturally with humans and achieve true anthropomorphism. In this paper, we focus on the emerging field of visual scene-aware dialogue system for cross-modal intelligent human-machine interaction, which faces the following challenges: (1) The video content has complex dynamic changes in both temporal and spatial semantic space, which makes it difficult to extract accurate visual semantic information; (2) The user{\textquoteright}s attention during multiple rounds of dialogue usually involves objects at different spatial positions in different video clips. This makes it necessary for the dialogue agent to have fine-grained reasoning capabilities to understand the user dialogue context; (3) There is information redundancy and complementary among multi-modal features, which requires reasonable processing of multi-modal information to enable the dialogue agent to gain a comprehensive understanding of the dialogue scene. To address the above challenges, this paper proposes a Transformer-based neural network framework to extract fine-grained visual semantic information through space-to-time and time-to-space bidirectional inference; and proposes a multi-modal fusion method based on the cross-attention framework, which enables multi-modal features to be fully interacted and fused in a cross manner. The experimental results show that compared with the baseline model, the model in this paper improves 39.5%, 32.1%, 19.7%, and 61.3% in the four metrics of BLEU, METEOR, ROUGE-L, and CIDEr, which represent the fluency, accuracy, adequacy, and recall of the generated conversation contents, respectively.",

keywords = "Cross-attention mechanism, Human-machine dialogue, Human-machine interaction, Scene awareness, Spatial-temporal reasoning",

author = "Feiyang Liu and Bin Guo and Hao Wang and Yan Liu",

note = "Publisher Copyright: {\textcopyright} 2023, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.; 17th CCF Conference on Computer Supported Cooperative Work and Social Computing, ChineseCSCW 2022 ; Conference date: 25-11-2022 Through 27-11-2022",

year = "2023",

doi = "10.1007/978-981-99-2385-4_25",

language = "英语",

isbn = "9789819923847",

series = "Communications in Computer and Information Science",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "337--351",

editor = "Yuqing Sun and Tun Lu and Yinzhang Guo and Xiaoxia Song and Hongfei Fan and Dongning Liu and Liping Gao and Bowen Du",

booktitle = "Computer Supported Cooperative Work and Social Computing - 17th CCF Conference, ChineseCSCW 2022, Revised Selected Papers",

}

Liu, F, Guo, B, Wang, H & Liu, Y 2023, Visual Scene-Aware Dialogue System for Cross-Modal Intelligent Human-Machine Interaction. 在 Y Sun, T Lu, Y Guo, X Song, H Fan, D Liu, L Gao & B Du (编辑), Computer Supported Cooperative Work and Social Computing - 17th CCF Conference, ChineseCSCW 2022, Revised Selected Papers. Communications in Computer and Information Science, 卷 1682 CCIS, Springer Science and Business Media Deutschland GmbH, 页码 337-351, 17th CCF Conference on Computer Supported Cooperative Work and Social Computing, ChineseCSCW 2022, Taiyuan, 中国, 25/11/22. https://doi.org/10.1007/978-981-99-2385-4_25

Visual Scene-Aware Dialogue System for Cross-Modal Intelligent Human-Machine Interaction. / Liu, Feiyang; Guo, Bin; Wang, Hao 等.
Computer Supported Cooperative Work and Social Computing - 17th CCF Conference, ChineseCSCW 2022, Revised Selected Papers. 编辑 / Yuqing Sun; Tun Lu; Yinzhang Guo; Xiaoxia Song; Hongfei Fan; Dongning Liu; Liping Gao; Bowen Du. Springer Science and Business Media Deutschland GmbH, 2023. 页码 337-351 (Communications in Computer and Information Science; 卷 1682 CCIS).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Visual Scene-Aware Dialogue System for Cross-Modal Intelligent Human-Machine Interaction

AU - Liu, Feiyang

AU - Guo, Bin

AU - Wang, Hao

AU - Liu, Yan

PY - 2023

Y1 - 2023

N2 - Adequate perception and understanding of the user’s visual context is an important part of a robot’s ability to interact naturally with humans and achieve true anthropomorphism. In this paper, we focus on the emerging field of visual scene-aware dialogue system for cross-modal intelligent human-machine interaction, which faces the following challenges: (1) The video content has complex dynamic changes in both temporal and spatial semantic space, which makes it difficult to extract accurate visual semantic information; (2) The user’s attention during multiple rounds of dialogue usually involves objects at different spatial positions in different video clips. This makes it necessary for the dialogue agent to have fine-grained reasoning capabilities to understand the user dialogue context; (3) There is information redundancy and complementary among multi-modal features, which requires reasonable processing of multi-modal information to enable the dialogue agent to gain a comprehensive understanding of the dialogue scene. To address the above challenges, this paper proposes a Transformer-based neural network framework to extract fine-grained visual semantic information through space-to-time and time-to-space bidirectional inference; and proposes a multi-modal fusion method based on the cross-attention framework, which enables multi-modal features to be fully interacted and fused in a cross manner. The experimental results show that compared with the baseline model, the model in this paper improves 39.5%, 32.1%, 19.7%, and 61.3% in the four metrics of BLEU, METEOR, ROUGE-L, and CIDEr, which represent the fluency, accuracy, adequacy, and recall of the generated conversation contents, respectively.

AB - Adequate perception and understanding of the user’s visual context is an important part of a robot’s ability to interact naturally with humans and achieve true anthropomorphism. In this paper, we focus on the emerging field of visual scene-aware dialogue system for cross-modal intelligent human-machine interaction, which faces the following challenges: (1) The video content has complex dynamic changes in both temporal and spatial semantic space, which makes it difficult to extract accurate visual semantic information; (2) The user’s attention during multiple rounds of dialogue usually involves objects at different spatial positions in different video clips. This makes it necessary for the dialogue agent to have fine-grained reasoning capabilities to understand the user dialogue context; (3) There is information redundancy and complementary among multi-modal features, which requires reasonable processing of multi-modal information to enable the dialogue agent to gain a comprehensive understanding of the dialogue scene. To address the above challenges, this paper proposes a Transformer-based neural network framework to extract fine-grained visual semantic information through space-to-time and time-to-space bidirectional inference; and proposes a multi-modal fusion method based on the cross-attention framework, which enables multi-modal features to be fully interacted and fused in a cross manner. The experimental results show that compared with the baseline model, the model in this paper improves 39.5%, 32.1%, 19.7%, and 61.3% in the four metrics of BLEU, METEOR, ROUGE-L, and CIDEr, which represent the fluency, accuracy, adequacy, and recall of the generated conversation contents, respectively.

KW - Cross-attention mechanism

KW - Human-machine dialogue

KW - Human-machine interaction

KW - Scene awareness

KW - Spatial-temporal reasoning

UR - http://www.scopus.com/inward/record.url?scp=85161132515&partnerID=8YFLogxK

U2 - 10.1007/978-981-99-2385-4_25

DO - 10.1007/978-981-99-2385-4_25

M3 - 会议稿件

AN - SCOPUS:85161132515

SN - 9789819923847

T3 - Communications in Computer and Information Science

SP - 337

EP - 351

BT - Computer Supported Cooperative Work and Social Computing - 17th CCF Conference, ChineseCSCW 2022, Revised Selected Papers

A2 - Sun, Yuqing

A2 - Lu, Tun

A2 - Guo, Yinzhang

A2 - Song, Xiaoxia

A2 - Fan, Hongfei

A2 - Liu, Dongning

A2 - Gao, Liping

A2 - Du, Bowen

PB - Springer Science and Business Media Deutschland GmbH

T2 - 17th CCF Conference on Computer Supported Cooperative Work and Social Computing, ChineseCSCW 2022

Y2 - 25 November 2022 through 27 November 2022

ER -

Liu F, Guo B, Wang H, Liu Y. Visual Scene-Aware Dialogue System for Cross-Modal Intelligent Human-Machine Interaction. 在 Sun Y, Lu T, Guo Y, Song X, Fan H, Liu D, Gao L, Du B, 编辑, Computer Supported Cooperative Work and Social Computing - 17th CCF Conference, ChineseCSCW 2022, Revised Selected Papers. Springer Science and Business Media Deutschland GmbH. 2023. 页码 337-351. (Communications in Computer and Information Science). doi: 10.1007/978-981-99-2385-4_25

Visual Scene-Aware Dialogue System for Cross-Modal Intelligent Human-Machine Interaction

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此