TY - JOUR
T1 - Style-aware two-stage learning framework for video captioning
AU - Ma, Yunchuan
AU - Zhu, Zheng
AU - Qi, Yuankai
AU - Beheshti, Amin
AU - Li, Ying
AU - Qing, Laiyun
AU - Li, Guorong
N1 - Publisher Copyright:
© 2024 The Author(s)
PY - 2024/10/9
Y1 - 2024/10/9
N2 - Significant progress has been made in video captioning in recent years. However, most existing methods directly learn from all given captions without distinguishing the styles of captions. The large diversity in these captions might bring ambiguity to the model learning. To address this issue, we propose a style-aware two-stage learning framework. In the first stage, the model is trained with captions of separate styles, including length style (short, medium, long), action style (single action or multiple actions), and object style (one object or more). For efficiency, a shared model with multiple individual style vectors is learned. In the second stage, a video style encoder is devised to capture style information from the input video, and it outputs a guidance signal of how to utilize the style vectors for the final caption generation. Without whistles and bells, our method achieves state-of-the-art performance on three widely-used public datasets, MSVD, MSR-VTT and VATEX. The source code and trained models will be made available to the public.
AB - Significant progress has been made in video captioning in recent years. However, most existing methods directly learn from all given captions without distinguishing the styles of captions. The large diversity in these captions might bring ambiguity to the model learning. To address this issue, we propose a style-aware two-stage learning framework. In the first stage, the model is trained with captions of separate styles, including length style (short, medium, long), action style (single action or multiple actions), and object style (one object or more). For efficiency, a shared model with multiple individual style vectors is learned. In the second stage, a video style encoder is devised to capture style information from the input video, and it outputs a guidance signal of how to utilize the style vectors for the final caption generation. Without whistles and bells, our method achieves state-of-the-art performance on three widely-used public datasets, MSVD, MSR-VTT and VATEX. The source code and trained models will be made available to the public.
KW - Controllable
KW - Style-aware
KW - Two-stage learning
KW - Video captioning
UR - http://www.scopus.com/inward/record.url?scp=85199418091&partnerID=8YFLogxK
U2 - 10.1016/j.knosys.2024.112258
DO - 10.1016/j.knosys.2024.112258
M3 - 文章
AN - SCOPUS:85199418091
SN - 0950-7051
VL - 301
JO - Knowledge-Based Systems
JF - Knowledge-Based Systems
M1 - 112258
ER -