TY - GEN
T1 - Multimodal Graph Conditioned Diffusion Model for Video Captioning
AU - Zhang, Benhui
AU - Gao, Junyu
AU - Yuan, Yuan
N1 - Publisher Copyright:
© 2026 Owner/Author.
PY - 2026/4/12
Y1 - 2026/4/12
N2 - Video captioning aims to describe the content of a given video with condensed natural language sentences. Such a captioning task is full of challenges since the high requirements for visual-textual relevance and multimodal fusion understanding. Previous works primarily focus on visual content modeling, often overlooking the rich semantic correlations between visual and textual modalities, which results in incomplete understanding of the multimodal context and suboptimal caption accuracy. In this paper, we propose a multimodal graph conditioned diffusion model for video captioning, named MGCDVc. The idea behind our model is to incorporate graph-based relational reasoning with diffusion-based generative modeling to jointly model cross-modal relationships and capture latent semantic structure. Specifically, we learn a set of latent concept anchors to bridge the visual and textual modality nodes, enabling the construction of a weighted multimodal graph. Then we introduce the graph conditioned diffusion strategy which generates the textual semantic nodes and associated edges under the graph structure awareness condition. Furthermore, a soft pruning mechanism is designed to filter out low-quality nodes, thus further refining the generated multimodal graph to provide more accurate semantic structural guidance for caption generation. Experimental results on several popular datasets demonstrate that our model achieves better performance in video captioning task.
AB - Video captioning aims to describe the content of a given video with condensed natural language sentences. Such a captioning task is full of challenges since the high requirements for visual-textual relevance and multimodal fusion understanding. Previous works primarily focus on visual content modeling, often overlooking the rich semantic correlations between visual and textual modalities, which results in incomplete understanding of the multimodal context and suboptimal caption accuracy. In this paper, we propose a multimodal graph conditioned diffusion model for video captioning, named MGCDVc. The idea behind our model is to incorporate graph-based relational reasoning with diffusion-based generative modeling to jointly model cross-modal relationships and capture latent semantic structure. Specifically, we learn a set of latent concept anchors to bridge the visual and textual modality nodes, enabling the construction of a weighted multimodal graph. Then we introduce the graph conditioned diffusion strategy which generates the textual semantic nodes and associated edges under the graph structure awareness condition. Furthermore, a soft pruning mechanism is designed to filter out low-quality nodes, thus further refining the generated multimodal graph to provide more accurate semantic structural guidance for caption generation. Experimental results on several popular datasets demonstrate that our model achieves better performance in video captioning task.
KW - graph neural network
KW - language generation
KW - multimodal learning
UR - https://www.scopus.com/pages/publications/105038530996
U2 - 10.1145/3774904.3792087
DO - 10.1145/3774904.3792087
M3 - 会议稿件
AN - SCOPUS:105038530996
T3 - WWW 2026 - Proceedings of the ACM Web Conference 2026
SP - 3566
EP - 3575
BT - WWW 2026 - Proceedings of the ACM Web Conference 2026
PB - Association for Computing Machinery, Inc
T2 - 35th ACM Web Conference, WWW 2026
Y2 - 29 June 2026 through 3 July 2026
ER -