TY - GEN
T1 - TAVGBench
T2 - 32nd ACM International Conference on Multimedia, MM 2024
AU - Mao, Yuxin
AU - Shen, Xuyang
AU - Zhang, Jing
AU - Qin, Zhen
AU - Zhou, Jinxing
AU - Xiang, Mochu
AU - Zhong, Yiran
AU - Dai, Yuchao
N1 - Publisher Copyright:
© 2024 ACM.
PY - 2024/10/28
Y1 - 2024/10/28
N2 - The Text to Audible-Video Generation (TAVG) task involves generating videos with accompanying audio based on text descriptions. Achieving this requires skillful alignment of both audio and video elements. To support research in this field, we have developed a comprehensive Text to Audible-Video Generation Benchmark (TAVGBench), which contains over 1.7 million clips with a total duration of 11.8 thousand hours. We propose an automatic annotation pipeline to ensure each audible video has detailed descriptions for both its audio and video contents. We also introduce the Audio-Visual Harmoni score (AVHScore) to provide a quantitative measure of the alignment between the generated audio and video modalities. Additionally, we present a baseline model for TAVG called TAVDiffusion, which uses a two-stream latent diffusion model to provide a fundamental starting point for further research in this area. We achieve the alignment of audio and video by employing cross-attention and contrastive learning. Through extensive experiments and evaluations on TAVGBench, we demonstrate the effectiveness of our proposed model under both conventional metrics and our proposed metrics. The dataset and code can be found on this page https://npucvr.github.io/TAVGBench/ and on github https://github.com/OpenNLPLab/TAVGBench.
AB - The Text to Audible-Video Generation (TAVG) task involves generating videos with accompanying audio based on text descriptions. Achieving this requires skillful alignment of both audio and video elements. To support research in this field, we have developed a comprehensive Text to Audible-Video Generation Benchmark (TAVGBench), which contains over 1.7 million clips with a total duration of 11.8 thousand hours. We propose an automatic annotation pipeline to ensure each audible video has detailed descriptions for both its audio and video contents. We also introduce the Audio-Visual Harmoni score (AVHScore) to provide a quantitative measure of the alignment between the generated audio and video modalities. Additionally, we present a baseline model for TAVG called TAVDiffusion, which uses a two-stream latent diffusion model to provide a fundamental starting point for further research in this area. We achieve the alignment of audio and video by employing cross-attention and contrastive learning. Through extensive experiments and evaluations on TAVGBench, we demonstrate the effectiveness of our proposed model under both conventional metrics and our proposed metrics. The dataset and code can be found on this page https://npucvr.github.io/TAVGBench/ and on github https://github.com/OpenNLPLab/TAVGBench.
KW - text to audible-video diffusion (tavdiffusion)
KW - text to audible-video generation benchmark (tavgbench)
UR - http://www.scopus.com/inward/record.url?scp=85206434503&partnerID=8YFLogxK
U2 - 10.1145/3664647.3680612
DO - 10.1145/3664647.3680612
M3 - 会议稿件
AN - SCOPUS:85206434503
T3 - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
SP - 6607
EP - 6616
BT - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
Y2 - 28 October 2024 through 1 November 2024
ER -