Cascade context-oriented spatio-temporal attention network for efficient and fine-grained video-grounded dialogues

Hao Wang, Bin Guo, Mengqi Chen, Qiuyun Zhang, Yasan Ding, Ying Zhang, Zhiwen Yu

科研成果: 期刊稿件文章同行评审

摘要

Video-Grounded Dialogue System (VGDS), focusing on generating reasonable responses based on multi-turn dialogue contexts and a given video, has received intensive attention recently. The key to building a superior VGDS lies in efficiently reasoning over visual and textual concepts of various granularities and achieving comprehensive visual-textual multi-modality alignment. Despite remarkable research progress, existing studies suffer from identifying context-relevant video parts while disregarding the impact of redundant information in long-form and content-dynamic videos. Further, current methods usually align all semantics in different modalities uniformly using a one-time cross-attention scheme, which neglects the sophisticated correspondence between various granularities of visual and textual concepts (e.g., still objects with nouns, dynamic events with verbs). To this end, we propose a novel system, namely Cascade cOntext-oriented Spatio-Temporal Attention Network (COSTA), to generate reasonable responses efficiently and accurately. Specifically, COSTA first adopts a cascade attention network to localize only the most relevant video clips and regions in a coarse-to-fine manner which effectively filters the irrelevant visual semantics. Secondly, we design a memory distillation-inspired iterative visual-textual cross-attention strategy to progressively integrate visual semantics with dialogue contexts across varying granularities, facilitating extensive multi-modal alignment. Experiments on several benchmarks demonstrate significant improvements in our model over state-of-the-art methods across various metrics.

源语言英语
文章编号197329
期刊Frontiers of Computer Science
19
7
DOI
出版状态已出版 - 7月 2025

指纹

探究 'Cascade context-oriented spatio-temporal attention network for efficient and fine-grained video-grounded dialogues' 的科研主题。它们共同构成独一无二的指纹。

引用此