TAVGBench: Benchmarking Text to Audible-Video Generation

Yuxin Mao; Xuyang Shen; Jing Zhang; Zhen Qin; Jinxing Zhou; Mochu Xiang; Yiran Zhong; Yuchao Dai

doi:10.1145/3664647.3680612

TAVGBench: Benchmarking Text to Audible-Video Generation

Yuxin Mao, Xuyang Shen, Jing Zhang, Zhen Qin, Jinxing Zhou, Mochu Xiang, Yiran Zhong, Yuchao Dai

School of Electronics and Information

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

5 Scopus citations

Abstract

The Text to Audible-Video Generation (TAVG) task involves generating videos with accompanying audio based on text descriptions. Achieving this requires skillful alignment of both audio and video elements. To support research in this field, we have developed a comprehensive Text to Audible-Video Generation Benchmark (TAVGBench), which contains over 1.7 million clips with a total duration of 11.8 thousand hours. We propose an automatic annotation pipeline to ensure each audible video has detailed descriptions for both its audio and video contents. We also introduce the Audio-Visual Harmoni score (AVHScore) to provide a quantitative measure of the alignment between the generated audio and video modalities. Additionally, we present a baseline model for TAVG called TAVDiffusion, which uses a two-stream latent diffusion model to provide a fundamental starting point for further research in this area. We achieve the alignment of audio and video by employing cross-attention and contrastive learning. Through extensive experiments and evaluations on TAVGBench, we demonstrate the effectiveness of our proposed model under both conventional metrics and our proposed metrics. The dataset and code can be found on this page https://npucvr.github.io/TAVGBench/ and on github https://github.com/OpenNLPLab/TAVGBench.

Original language	English
Title of host publication	MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
Publisher	Association for Computing Machinery, Inc
Pages	6607-6616
Number of pages	10
ISBN (Electronic)	9798400706868
DOIs	https://doi.org/10.1145/3664647.3680612
State	Published - 28 Oct 2024
Event	32nd ACM International Conference on Multimedia, MM 2024 - Melbourne, Australia Duration: 28 Oct 2024 → 1 Nov 2024

Publication series

Name	MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

Conference

Conference	32nd ACM International Conference on Multimedia, MM 2024
Country/Territory	Australia
City	Melbourne
Period	28/10/24 → 1/11/24

Keywords

text to audible-video diffusion (tavdiffusion)
text to audible-video generation benchmark (tavgbench)

Access to Document

10.1145/3664647.3680612

Cite this

Mao, Y., Shen, X., Zhang, J., Qin, Z., Zhou, J., Xiang, M., Zhong, Y., & Dai, Y. (2024). TAVGBench: Benchmarking Text to Audible-Video Generation. In MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia (pp. 6607-6616). (MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia). Association for Computing Machinery, Inc. https://doi.org/10.1145/3664647.3680612

@inproceedings{501603826b79438c8b1736db533959f8,

title = "TAVGBench: Benchmarking Text to Audible-Video Generation",

abstract = "The Text to Audible-Video Generation (TAVG) task involves generating videos with accompanying audio based on text descriptions. Achieving this requires skillful alignment of both audio and video elements. To support research in this field, we have developed a comprehensive Text to Audible-Video Generation Benchmark (TAVGBench), which contains over 1.7 million clips with a total duration of 11.8 thousand hours. We propose an automatic annotation pipeline to ensure each audible video has detailed descriptions for both its audio and video contents. We also introduce the Audio-Visual Harmoni score (AVHScore) to provide a quantitative measure of the alignment between the generated audio and video modalities. Additionally, we present a baseline model for TAVG called TAVDiffusion, which uses a two-stream latent diffusion model to provide a fundamental starting point for further research in this area. We achieve the alignment of audio and video by employing cross-attention and contrastive learning. Through extensive experiments and evaluations on TAVGBench, we demonstrate the effectiveness of our proposed model under both conventional metrics and our proposed metrics. The dataset and code can be found on this page https://npucvr.github.io/TAVGBench/ and on github https://github.com/OpenNLPLab/TAVGBench.",

keywords = "text to audible-video diffusion (tavdiffusion), text to audible-video generation benchmark (tavgbench)",

author = "Yuxin Mao and Xuyang Shen and Jing Zhang and Zhen Qin and Jinxing Zhou and Mochu Xiang and Yiran Zhong and Yuchao Dai",

note = "Publisher Copyright: {\textcopyright} 2024 ACM.; 32nd ACM International Conference on Multimedia, MM 2024 ; Conference date: 28-10-2024 Through 01-11-2024",

year = "2024",

month = oct,

day = "28",

doi = "10.1145/3664647.3680612",

language = "英语",

series = "MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia",

publisher = "Association for Computing Machinery, Inc",

pages = "6607--6616",

booktitle = "MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia",

}

Mao, Y, Shen, X, Zhang, J, Qin, Z, Zhou, J, Xiang, M, Zhong, Y & Dai, Y 2024, TAVGBench: Benchmarking Text to Audible-Video Generation. in MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia. MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia, Association for Computing Machinery, Inc, pp. 6607-6616, 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, Australia, 28/10/24. https://doi.org/10.1145/3664647.3680612

TAVGBench: Benchmarking Text to Audible-Video Generation. / Mao, Yuxin; Shen, Xuyang; Zhang, Jing et al.
MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia. Association for Computing Machinery, Inc, 2024. p. 6607-6616 (MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - TAVGBench

T2 - 32nd ACM International Conference on Multimedia, MM 2024

AU - Mao, Yuxin

AU - Shen, Xuyang

AU - Zhang, Jing

AU - Qin, Zhen

AU - Zhou, Jinxing

AU - Xiang, Mochu

AU - Zhong, Yiran

AU - Dai, Yuchao

PY - 2024/10/28

Y1 - 2024/10/28

N2 - The Text to Audible-Video Generation (TAVG) task involves generating videos with accompanying audio based on text descriptions. Achieving this requires skillful alignment of both audio and video elements. To support research in this field, we have developed a comprehensive Text to Audible-Video Generation Benchmark (TAVGBench), which contains over 1.7 million clips with a total duration of 11.8 thousand hours. We propose an automatic annotation pipeline to ensure each audible video has detailed descriptions for both its audio and video contents. We also introduce the Audio-Visual Harmoni score (AVHScore) to provide a quantitative measure of the alignment between the generated audio and video modalities. Additionally, we present a baseline model for TAVG called TAVDiffusion, which uses a two-stream latent diffusion model to provide a fundamental starting point for further research in this area. We achieve the alignment of audio and video by employing cross-attention and contrastive learning. Through extensive experiments and evaluations on TAVGBench, we demonstrate the effectiveness of our proposed model under both conventional metrics and our proposed metrics. The dataset and code can be found on this page https://npucvr.github.io/TAVGBench/ and on github https://github.com/OpenNLPLab/TAVGBench.

AB - The Text to Audible-Video Generation (TAVG) task involves generating videos with accompanying audio based on text descriptions. Achieving this requires skillful alignment of both audio and video elements. To support research in this field, we have developed a comprehensive Text to Audible-Video Generation Benchmark (TAVGBench), which contains over 1.7 million clips with a total duration of 11.8 thousand hours. We propose an automatic annotation pipeline to ensure each audible video has detailed descriptions for both its audio and video contents. We also introduce the Audio-Visual Harmoni score (AVHScore) to provide a quantitative measure of the alignment between the generated audio and video modalities. Additionally, we present a baseline model for TAVG called TAVDiffusion, which uses a two-stream latent diffusion model to provide a fundamental starting point for further research in this area. We achieve the alignment of audio and video by employing cross-attention and contrastive learning. Through extensive experiments and evaluations on TAVGBench, we demonstrate the effectiveness of our proposed model under both conventional metrics and our proposed metrics. The dataset and code can be found on this page https://npucvr.github.io/TAVGBench/ and on github https://github.com/OpenNLPLab/TAVGBench.

KW - text to audible-video diffusion (tavdiffusion)

KW - text to audible-video generation benchmark (tavgbench)

UR - http://www.scopus.com/inward/record.url?scp=85206434503&partnerID=8YFLogxK

U2 - 10.1145/3664647.3680612

DO - 10.1145/3664647.3680612

M3 - 会议稿件

AN - SCOPUS:85206434503

T3 - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

SP - 6607

EP - 6616

BT - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

PB - Association for Computing Machinery, Inc

Y2 - 28 October 2024 through 1 November 2024

ER -

TAVGBench: Benchmarking Text to Audible-Video Generation

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this