Center-enhanced video captioning model with multimodal semantic alignment

Benhui Zhang; Junyu Gao; Yuan Yuan

doi:10.1016/j.neunet.2024.106744

Center-enhanced video captioning model with multimodal semantic alignment

Benhui Zhang, Junyu Gao, Yuan Yuan

光电与智能研究院

科研成果: 期刊稿件 › 文章 › 同行评审

3 引用（Scopus）

摘要

Video captioning aims at automatically generating descriptive sentences based on the given video, establishing an association between the visual contents and textual languages, has attracted great attention and plays a significant role in many practical applications. Previous researches focus more on the aspect of caption generation, ignoring the alignment of multimodal feature and just simply concatenating them. Besides, video feature extraction is usually done in an off-line manner, which leads to the fact that the extracted feature may not adapted to the subsequent caption generation task. To improve the applicability of extracted features for downstream caption generation and to address the issue of multimodal semantic alignment fusion, we propose an end-to-end center-enhanced video captioning model with multimodal semantic alignment, which integrates feature extraction and caption generation task into a unified framework. In order to enhance the completeness of semantic features, we design a center enhancement strategy where the visual–textual deep joint semantic feature can be captured via incremental clustering, then the cluster centers can serve as the guidance for better caption generation. Moreover, we propose to promote the visual–textual multimodal alignment fusion by learning the visual and textual representation in a shared latent semantic space, so as to alleviate the multimodal misalignment problem. Experimental results on two popular datasets MSVD and MSR-VTT demonstrate that the proposed model could outperform the state-of-the-art methods, obtaining higher-quality caption results.

源语言	英语
文章编号	106744
期刊	Neural Networks
卷	180
DOI	https://doi.org/10.1016/j.neunet.2024.106744
出版状态	已出版 - 12月 2024

访问文件

10.1016/j.neunet.2024.106744

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{54e2bfb7e9c748ebb92d3c0a3150bea7,

title = "Center-enhanced video captioning model with multimodal semantic alignment",

abstract = "Video captioning aims at automatically generating descriptive sentences based on the given video, establishing an association between the visual contents and textual languages, has attracted great attention and plays a significant role in many practical applications. Previous researches focus more on the aspect of caption generation, ignoring the alignment of multimodal feature and just simply concatenating them. Besides, video feature extraction is usually done in an off-line manner, which leads to the fact that the extracted feature may not adapted to the subsequent caption generation task. To improve the applicability of extracted features for downstream caption generation and to address the issue of multimodal semantic alignment fusion, we propose an end-to-end center-enhanced video captioning model with multimodal semantic alignment, which integrates feature extraction and caption generation task into a unified framework. In order to enhance the completeness of semantic features, we design a center enhancement strategy where the visual–textual deep joint semantic feature can be captured via incremental clustering, then the cluster centers can serve as the guidance for better caption generation. Moreover, we propose to promote the visual–textual multimodal alignment fusion by learning the visual and textual representation in a shared latent semantic space, so as to alleviate the multimodal misalignment problem. Experimental results on two popular datasets MSVD and MSR-VTT demonstrate that the proposed model could outperform the state-of-the-art methods, obtaining higher-quality caption results.",

keywords = "Center enhancement, Multimodal semantic alignment, Video captioning",

author = "Benhui Zhang and Junyu Gao and Yuan Yuan",

note = "Publisher Copyright: {\textcopyright} 2024 Elsevier Ltd",

year = "2024",

month = dec,

doi = "10.1016/j.neunet.2024.106744",

language = "英语",

volume = "180",

journal = "Neural Networks",

issn = "0893-6080",

publisher = "Elsevier Ltd",

}

TY - JOUR

T1 - Center-enhanced video captioning model with multimodal semantic alignment

AU - Zhang, Benhui

AU - Gao, Junyu

AU - Yuan, Yuan

PY - 2024/12

Y1 - 2024/12

N2 - Video captioning aims at automatically generating descriptive sentences based on the given video, establishing an association between the visual contents and textual languages, has attracted great attention and plays a significant role in many practical applications. Previous researches focus more on the aspect of caption generation, ignoring the alignment of multimodal feature and just simply concatenating them. Besides, video feature extraction is usually done in an off-line manner, which leads to the fact that the extracted feature may not adapted to the subsequent caption generation task. To improve the applicability of extracted features for downstream caption generation and to address the issue of multimodal semantic alignment fusion, we propose an end-to-end center-enhanced video captioning model with multimodal semantic alignment, which integrates feature extraction and caption generation task into a unified framework. In order to enhance the completeness of semantic features, we design a center enhancement strategy where the visual–textual deep joint semantic feature can be captured via incremental clustering, then the cluster centers can serve as the guidance for better caption generation. Moreover, we propose to promote the visual–textual multimodal alignment fusion by learning the visual and textual representation in a shared latent semantic space, so as to alleviate the multimodal misalignment problem. Experimental results on two popular datasets MSVD and MSR-VTT demonstrate that the proposed model could outperform the state-of-the-art methods, obtaining higher-quality caption results.

AB - Video captioning aims at automatically generating descriptive sentences based on the given video, establishing an association between the visual contents and textual languages, has attracted great attention and plays a significant role in many practical applications. Previous researches focus more on the aspect of caption generation, ignoring the alignment of multimodal feature and just simply concatenating them. Besides, video feature extraction is usually done in an off-line manner, which leads to the fact that the extracted feature may not adapted to the subsequent caption generation task. To improve the applicability of extracted features for downstream caption generation and to address the issue of multimodal semantic alignment fusion, we propose an end-to-end center-enhanced video captioning model with multimodal semantic alignment, which integrates feature extraction and caption generation task into a unified framework. In order to enhance the completeness of semantic features, we design a center enhancement strategy where the visual–textual deep joint semantic feature can be captured via incremental clustering, then the cluster centers can serve as the guidance for better caption generation. Moreover, we propose to promote the visual–textual multimodal alignment fusion by learning the visual and textual representation in a shared latent semantic space, so as to alleviate the multimodal misalignment problem. Experimental results on two popular datasets MSVD and MSR-VTT demonstrate that the proposed model could outperform the state-of-the-art methods, obtaining higher-quality caption results.

KW - Center enhancement

KW - Multimodal semantic alignment

KW - Video captioning

UR - http://www.scopus.com/inward/record.url?scp=85204805697&partnerID=8YFLogxK

U2 - 10.1016/j.neunet.2024.106744

DO - 10.1016/j.neunet.2024.106744

M3 - 文章

C2 - 39326191

AN - SCOPUS:85204805697

SN - 0893-6080

VL - 180

JO - Neural Networks

JF - Neural Networks

M1 - 106744

ER -

Center-enhanced video captioning model with multimodal semantic alignment

摘要

访问文件

其它文件与链接

指纹

引用此