Cascade context-oriented spatio-temporal attention network for efficient and fine-grained video-grounded dialogues

Hao Wang; Bin Guo; Mengqi Chen; Qiuyun Zhang; Yasan Ding; Ying Zhang; Zhiwen Yu

doi:10.1007/s11704-024-40387-w

Cascade context-oriented spatio-temporal attention network for efficient and fine-grained video-grounded dialogues

Hao Wang, Bin Guo, Mengqi Chen, Qiuyun Zhang, Yasan Ding, Ying Zhang, Zhiwen Yu

计算机学院

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

Video-Grounded Dialogue System (VGDS), focusing on generating reasonable responses based on multi-turn dialogue contexts and a given video, has received intensive attention recently. The key to building a superior VGDS lies in efficiently reasoning over visual and textual concepts of various granularities and achieving comprehensive visual-textual multi-modality alignment. Despite remarkable research progress, existing studies suffer from identifying context-relevant video parts while disregarding the impact of redundant information in long-form and content-dynamic videos. Further, current methods usually align all semantics in different modalities uniformly using a one-time cross-attention scheme, which neglects the sophisticated correspondence between various granularities of visual and textual concepts (e.g., still objects with nouns, dynamic events with verbs). To this end, we propose a novel system, namely Cascade cOntext-oriented Spatio-Temporal Attention Network (COSTA), to generate reasonable responses efficiently and accurately. Specifically, COSTA first adopts a cascade attention network to localize only the most relevant video clips and regions in a coarse-to-fine manner which effectively filters the irrelevant visual semantics. Secondly, we design a memory distillation-inspired iterative visual-textual cross-attention strategy to progressively integrate visual semantics with dialogue contexts across varying granularities, facilitating extensive multi-modal alignment. Experiments on several benchmarks demonstrate significant improvements in our model over state-of-the-art methods across various metrics.

源语言	英语
文章编号	197329
期刊	Frontiers of Computer Science
卷	19
期	7
DOI	https://doi.org/10.1007/s11704-024-40387-w
出版状态	已出版 - 7月 2025

访问文件

10.1007/s11704-024-40387-w

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{5afc6e77b8ea41d68c2212c45ca7a0af,

title = "Cascade context-oriented spatio-temporal attention network for efficient and fine-grained video-grounded dialogues",

abstract = "Video-Grounded Dialogue System (VGDS), focusing on generating reasonable responses based on multi-turn dialogue contexts and a given video, has received intensive attention recently. The key to building a superior VGDS lies in efficiently reasoning over visual and textual concepts of various granularities and achieving comprehensive visual-textual multi-modality alignment. Despite remarkable research progress, existing studies suffer from identifying context-relevant video parts while disregarding the impact of redundant information in long-form and content-dynamic videos. Further, current methods usually align all semantics in different modalities uniformly using a one-time cross-attention scheme, which neglects the sophisticated correspondence between various granularities of visual and textual concepts (e.g., still objects with nouns, dynamic events with verbs). To this end, we propose a novel system, namely Cascade cOntext-oriented Spatio-Temporal Attention Network (COSTA), to generate reasonable responses efficiently and accurately. Specifically, COSTA first adopts a cascade attention network to localize only the most relevant video clips and regions in a coarse-to-fine manner which effectively filters the irrelevant visual semantics. Secondly, we design a memory distillation-inspired iterative visual-textual cross-attention strategy to progressively integrate visual semantics with dialogue contexts across varying granularities, facilitating extensive multi-modal alignment. Experiments on several benchmarks demonstrate significant improvements in our model over state-of-the-art methods across various metrics.",

keywords = "multi-modality, spatio-temporal attention, video-grounded dialogue",

author = "Hao Wang and Bin Guo and Mengqi Chen and Qiuyun Zhang and Yasan Ding and Ying Zhang and Zhiwen Yu",

note = "Publisher Copyright: {\textcopyright} Higher Education Press 2025.",

year = "2025",

month = jul,

doi = "10.1007/s11704-024-40387-w",

language = "英语",

volume = "19",

journal = "Frontiers of Computer Science",

issn = "2095-2228",

publisher = "Higher Education Press Limited Company",

number = "7",

}

TY - JOUR

T1 - Cascade context-oriented spatio-temporal attention network for efficient and fine-grained video-grounded dialogues

AU - Wang, Hao

AU - Guo, Bin

AU - Chen, Mengqi

AU - Zhang, Qiuyun

AU - Ding, Yasan

AU - Zhang, Ying

AU - Yu, Zhiwen

N1 - Publisher Copyright: © Higher Education Press 2025.

PY - 2025/7

Y1 - 2025/7

N2 - Video-Grounded Dialogue System (VGDS), focusing on generating reasonable responses based on multi-turn dialogue contexts and a given video, has received intensive attention recently. The key to building a superior VGDS lies in efficiently reasoning over visual and textual concepts of various granularities and achieving comprehensive visual-textual multi-modality alignment. Despite remarkable research progress, existing studies suffer from identifying context-relevant video parts while disregarding the impact of redundant information in long-form and content-dynamic videos. Further, current methods usually align all semantics in different modalities uniformly using a one-time cross-attention scheme, which neglects the sophisticated correspondence between various granularities of visual and textual concepts (e.g., still objects with nouns, dynamic events with verbs). To this end, we propose a novel system, namely Cascade cOntext-oriented Spatio-Temporal Attention Network (COSTA), to generate reasonable responses efficiently and accurately. Specifically, COSTA first adopts a cascade attention network to localize only the most relevant video clips and regions in a coarse-to-fine manner which effectively filters the irrelevant visual semantics. Secondly, we design a memory distillation-inspired iterative visual-textual cross-attention strategy to progressively integrate visual semantics with dialogue contexts across varying granularities, facilitating extensive multi-modal alignment. Experiments on several benchmarks demonstrate significant improvements in our model over state-of-the-art methods across various metrics.

AB - Video-Grounded Dialogue System (VGDS), focusing on generating reasonable responses based on multi-turn dialogue contexts and a given video, has received intensive attention recently. The key to building a superior VGDS lies in efficiently reasoning over visual and textual concepts of various granularities and achieving comprehensive visual-textual multi-modality alignment. Despite remarkable research progress, existing studies suffer from identifying context-relevant video parts while disregarding the impact of redundant information in long-form and content-dynamic videos. Further, current methods usually align all semantics in different modalities uniformly using a one-time cross-attention scheme, which neglects the sophisticated correspondence between various granularities of visual and textual concepts (e.g., still objects with nouns, dynamic events with verbs). To this end, we propose a novel system, namely Cascade cOntext-oriented Spatio-Temporal Attention Network (COSTA), to generate reasonable responses efficiently and accurately. Specifically, COSTA first adopts a cascade attention network to localize only the most relevant video clips and regions in a coarse-to-fine manner which effectively filters the irrelevant visual semantics. Secondly, we design a memory distillation-inspired iterative visual-textual cross-attention strategy to progressively integrate visual semantics with dialogue contexts across varying granularities, facilitating extensive multi-modal alignment. Experiments on several benchmarks demonstrate significant improvements in our model over state-of-the-art methods across various metrics.

KW - multi-modality

KW - spatio-temporal attention

KW - video-grounded dialogue

UR - http://www.scopus.com/inward/record.url?scp=85211919114&partnerID=8YFLogxK

U2 - 10.1007/s11704-024-40387-w

DO - 10.1007/s11704-024-40387-w

M3 - 文章

AN - SCOPUS:85211919114

SN - 2095-2228

VL - 19

JO - Frontiers of Computer Science

JF - Frontiers of Computer Science

IS - 7

M1 - 197329

ER -

Cascade context-oriented spatio-temporal attention network for efficient and fine-grained video-grounded dialogues

摘要

访问文件

其它文件与链接

指纹

引用此