Cascade context-oriented spatio-temporal attention network for efficient and fine-grained video-grounded dialogues

Hao Wang, Bin Guo, Mengqi Chen, Qiuyun Zhang, Yasan Ding, Ying Zhang, Zhiwen Yu

Research output: Contribution to journalArticlepeer-review

Abstract

Video-Grounded Dialogue System (VGDS), focusing on generating reasonable responses based on multi-turn dialogue contexts and a given video, has received intensive attention recently. The key to building a superior VGDS lies in efficiently reasoning over visual and textual concepts of various granularities and achieving comprehensive visual-textual multi-modality alignment. Despite remarkable research progress, existing studies suffer from identifying context-relevant video parts while disregarding the impact of redundant information in long-form and content-dynamic videos. Further, current methods usually align all semantics in different modalities uniformly using a one-time cross-attention scheme, which neglects the sophisticated correspondence between various granularities of visual and textual concepts (e.g., still objects with nouns, dynamic events with verbs). To this end, we propose a novel system, namely Cascade cOntext-oriented Spatio-Temporal Attention Network (COSTA), to generate reasonable responses efficiently and accurately. Specifically, COSTA first adopts a cascade attention network to localize only the most relevant video clips and regions in a coarse-to-fine manner which effectively filters the irrelevant visual semantics. Secondly, we design a memory distillation-inspired iterative visual-textual cross-attention strategy to progressively integrate visual semantics with dialogue contexts across varying granularities, facilitating extensive multi-modal alignment. Experiments on several benchmarks demonstrate significant improvements in our model over state-of-the-art methods across various metrics.

Original languageEnglish
Article number197329
JournalFrontiers of Computer Science
Volume19
Issue number7
DOIs
StatePublished - Jul 2025

Keywords

  • multi-modality
  • spatio-temporal attention
  • video-grounded dialogue

Fingerprint

Dive into the research topics of 'Cascade context-oriented spatio-temporal attention network for efficient and fine-grained video-grounded dialogues'. Together they form a unique fingerprint.

Cite this