Style-aware two-stage learning framework for video captioning

Yunchuan Ma; Zheng Zhu; Yuankai Qi; Amin Beheshti; Ying Li; Laiyun Qing; Guorong Li

doi:10.1016/j.knosys.2024.112258

Style-aware two-stage learning framework for video captioning

Yunchuan Ma, Zheng Zhu, Yuankai Qi, Amin Beheshti, Ying Li, Laiyun Qing, Guorong Li

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

5 引用（Scopus）

摘要

Significant progress has been made in video captioning in recent years. However, most existing methods directly learn from all given captions without distinguishing the styles of captions. The large diversity in these captions might bring ambiguity to the model learning. To address this issue, we propose a style-aware two-stage learning framework. In the first stage, the model is trained with captions of separate styles, including length style (short, medium, long), action style (single action or multiple actions), and object style (one object or more). For efficiency, a shared model with multiple individual style vectors is learned. In the second stage, a video style encoder is devised to capture style information from the input video, and it outputs a guidance signal of how to utilize the style vectors for the final caption generation. Without whistles and bells, our method achieves state-of-the-art performance on three widely-used public datasets, MSVD, MSR-VTT and VATEX. The source code and trained models will be made available to the public.

源语言	英语
文章编号	112258
期刊	Knowledge-Based Systems
卷	301
DOI	https://doi.org/10.1016/j.knosys.2024.112258
出版状态	已出版 - 9 10月 2024

访问文件

10.1016/j.knosys.2024.112258

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{4f69f3e469f949caac903a1f2f3684f0,

title = "Style-aware two-stage learning framework for video captioning",

abstract = "Significant progress has been made in video captioning in recent years. However, most existing methods directly learn from all given captions without distinguishing the styles of captions. The large diversity in these captions might bring ambiguity to the model learning. To address this issue, we propose a style-aware two-stage learning framework. In the first stage, the model is trained with captions of separate styles, including length style (short, medium, long), action style (single action or multiple actions), and object style (one object or more). For efficiency, a shared model with multiple individual style vectors is learned. In the second stage, a video style encoder is devised to capture style information from the input video, and it outputs a guidance signal of how to utilize the style vectors for the final caption generation. Without whistles and bells, our method achieves state-of-the-art performance on three widely-used public datasets, MSVD, MSR-VTT and VATEX. The source code and trained models will be made available to the public.",

keywords = "Controllable, Style-aware, Two-stage learning, Video captioning",

author = "Yunchuan Ma and Zheng Zhu and Yuankai Qi and Amin Beheshti and Ying Li and Laiyun Qing and Guorong Li",

note = "Publisher Copyright: {\textcopyright} 2024 The Author(s)",

year = "2024",

month = oct,

day = "9",

doi = "10.1016/j.knosys.2024.112258",

language = "英语",

volume = "301",

journal = "Knowledge-Based Systems",

issn = "0950-7051",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Style-aware two-stage learning framework for video captioning

AU - Ma, Yunchuan

AU - Zhu, Zheng

AU - Qi, Yuankai

AU - Beheshti, Amin

AU - Li, Ying

AU - Qing, Laiyun

AU - Li, Guorong

PY - 2024/10/9

Y1 - 2024/10/9

N2 - Significant progress has been made in video captioning in recent years. However, most existing methods directly learn from all given captions without distinguishing the styles of captions. The large diversity in these captions might bring ambiguity to the model learning. To address this issue, we propose a style-aware two-stage learning framework. In the first stage, the model is trained with captions of separate styles, including length style (short, medium, long), action style (single action or multiple actions), and object style (one object or more). For efficiency, a shared model with multiple individual style vectors is learned. In the second stage, a video style encoder is devised to capture style information from the input video, and it outputs a guidance signal of how to utilize the style vectors for the final caption generation. Without whistles and bells, our method achieves state-of-the-art performance on three widely-used public datasets, MSVD, MSR-VTT and VATEX. The source code and trained models will be made available to the public.

AB - Significant progress has been made in video captioning in recent years. However, most existing methods directly learn from all given captions without distinguishing the styles of captions. The large diversity in these captions might bring ambiguity to the model learning. To address this issue, we propose a style-aware two-stage learning framework. In the first stage, the model is trained with captions of separate styles, including length style (short, medium, long), action style (single action or multiple actions), and object style (one object or more). For efficiency, a shared model with multiple individual style vectors is learned. In the second stage, a video style encoder is devised to capture style information from the input video, and it outputs a guidance signal of how to utilize the style vectors for the final caption generation. Without whistles and bells, our method achieves state-of-the-art performance on three widely-used public datasets, MSVD, MSR-VTT and VATEX. The source code and trained models will be made available to the public.

KW - Controllable

KW - Style-aware

KW - Two-stage learning

KW - Video captioning

UR - http://www.scopus.com/inward/record.url?scp=85199418091&partnerID=8YFLogxK

U2 - 10.1016/j.knosys.2024.112258

DO - 10.1016/j.knosys.2024.112258

M3 - 文章

AN - SCOPUS:85199418091

SN - 0950-7051

VL - 301

JO - Knowledge-Based Systems

JF - Knowledge-Based Systems

M1 - 112258

ER -

Style-aware two-stage learning framework for video captioning

摘要

访问文件

其它文件与链接

指纹

引用此