Center-enhanced video captioning model with multimodal semantic alignment

Benhui Zhang, Junyu Gao, Yuan Yuan

科研成果: 期刊稿件文章同行评审

3 引用 (Scopus)

摘要

Video captioning aims at automatically generating descriptive sentences based on the given video, establishing an association between the visual contents and textual languages, has attracted great attention and plays a significant role in many practical applications. Previous researches focus more on the aspect of caption generation, ignoring the alignment of multimodal feature and just simply concatenating them. Besides, video feature extraction is usually done in an off-line manner, which leads to the fact that the extracted feature may not adapted to the subsequent caption generation task. To improve the applicability of extracted features for downstream caption generation and to address the issue of multimodal semantic alignment fusion, we propose an end-to-end center-enhanced video captioning model with multimodal semantic alignment, which integrates feature extraction and caption generation task into a unified framework. In order to enhance the completeness of semantic features, we design a center enhancement strategy where the visual–textual deep joint semantic feature can be captured via incremental clustering, then the cluster centers can serve as the guidance for better caption generation. Moreover, we propose to promote the visual–textual multimodal alignment fusion by learning the visual and textual representation in a shared latent semantic space, so as to alleviate the multimodal misalignment problem. Experimental results on two popular datasets MSVD and MSR-VTT demonstrate that the proposed model could outperform the state-of-the-art methods, obtaining higher-quality caption results.

源语言英语
文章编号106744
期刊Neural Networks
180
DOI
出版状态已出版 - 12月 2024

指纹

探究 'Center-enhanced video captioning model with multimodal semantic alignment' 的科研主题。它们共同构成独一无二的指纹。

引用此