跳到主要导航 跳到搜索 跳到主要内容

Multimodal Graph Conditioned Diffusion Model for Video Captioning

  • Northwestern Polytechnical University Xian

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Video captioning aims to describe the content of a given video with condensed natural language sentences. Such a captioning task is full of challenges since the high requirements for visual-textual relevance and multimodal fusion understanding. Previous works primarily focus on visual content modeling, often overlooking the rich semantic correlations between visual and textual modalities, which results in incomplete understanding of the multimodal context and suboptimal caption accuracy. In this paper, we propose a multimodal graph conditioned diffusion model for video captioning, named MGCDVc. The idea behind our model is to incorporate graph-based relational reasoning with diffusion-based generative modeling to jointly model cross-modal relationships and capture latent semantic structure. Specifically, we learn a set of latent concept anchors to bridge the visual and textual modality nodes, enabling the construction of a weighted multimodal graph. Then we introduce the graph conditioned diffusion strategy which generates the textual semantic nodes and associated edges under the graph structure awareness condition. Furthermore, a soft pruning mechanism is designed to filter out low-quality nodes, thus further refining the generated multimodal graph to provide more accurate semantic structural guidance for caption generation. Experimental results on several popular datasets demonstrate that our model achieves better performance in video captioning task.

源语言英语
主期刊名WWW 2026 - Proceedings of the ACM Web Conference 2026
出版商Association for Computing Machinery, Inc
3566-3575
页数10
ISBN(电子版)9798400723070
DOI
出版状态已出版 - 12 4月 2026
活动35th ACM Web Conference, WWW 2026 - Dubai, 阿拉伯联合酋长国
期限: 29 6月 20263 7月 2026

出版系列

姓名WWW 2026 - Proceedings of the ACM Web Conference 2026

会议

会议35th ACM Web Conference, WWW 2026
国家/地区阿拉伯联合酋长国
Dubai
时期29/06/263/07/26

指纹

探究 'Multimodal Graph Conditioned Diffusion Model for Video Captioning' 的科研主题。它们共同构成独一无二的指纹。

引用此