Fine-grained Audible Video Description

Xuyang Shen; Dong Li; Jinxing Zhou; Zhen Qin; Bowen He; Xiaodong Han; Aixuan Li; Yuchao Dai; Ling Peng Kong; Meng Wang; Yu Qiao; Yiran Zhong

doi:10.1109/CVPR52729.2023.01020

Fine-grained Audible Video Description

Xuyang Shen, Dong Li, Jinxing Zhou, Zhen Qin, Bowen He, Xiaodong Han, Aixuan Li, Yuchao Dai, Ling Peng Kong, Meng Wang, Yu Qiao, Yiran Zhong

电子信息学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

20 引用（Scopus）

摘要

We explore a new task for audio-visual-language modeling called fine-grained audible video description (FAVD). It aims to provide detailed textual descriptions for the given audible videos, including the appearance and spatial locations of each object, the actions of moving objects, and the sounds in videos. Existing visual-language modeling tasks often concentrate on visual cues in videos while undervaluing the language and audio modalities. On the other hand, FAVD requires not only audio-visual-language modeling skills but also paragraph-level language generation abilities. We construct the first fine-grained audible video description benchmark (FAVDBench) to facilitate this research. For each video clip, we first provide a one-sentence summary of the video, i.e., the caption, followed by 4-6 sentences describing the visual details and 1-2 audio-related descriptions at the end. The descriptions are provided in both English and Chinese. We create two new metrics for this task: an EntityScore to gauge the completeness of entities in the visual descriptions, and an AudioScore to assess the audio descriptions. As a preliminary approach to this task, we propose an audio-visual-language transformer that extends existing video captioning model with an additional audio branch. We combine the masked language modeling and auto-regressive language modeling losses to optimize our model so that it can produce paragraph-level descriptions. We illustrate the efficiency of our model in audio-visual-language modeling by evaluating it against the proposed benchmark using both conventional captioning metrics and our proposed metrics. We further put our benchmark to the test in video generation models, demonstrating that employing fine-grained video descriptions can create more intricate videos than using captions. Code and dataset are available at https://github.com/OpenNLPLab/FAVDBench. Our online benchmark is available at www.avlbench.opennlplab.cn.

源语言	英语
主期刊名	Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023
出版商	IEEE Computer Society
页	10585-10596
页数	12
ISBN（电子版）	9798350301298
DOI	https://doi.org/10.1109/CVPR52729.2023.01020
出版状态	已出版 - 2023
活动	2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 - Vancouver, 加拿大期限: 18 6月 2023 → 22 6月 2023

出版系列

姓名	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
卷	2023-June
ISSN（印刷版）	1063-6919

会议

会议	2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023
国家/地区	加拿大
市	Vancouver
时期	18/06/23 → 22/06/23

访问文件

10.1109/CVPR52729.2023.01020

其它文件与链接

链接到 Scopus 的出版物

引用此

Shen, X., Li, D., Zhou, J., Qin, Z., He, B., Han, X., Li, A., Dai, Y., Kong, L. P., Wang, M., Qiao, Y., & Zhong, Y. (2023). Fine-grained Audible Video Description. 在 Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 (页码 10585-10596). (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; 卷 2023-June). IEEE Computer Society. https://doi.org/10.1109/CVPR52729.2023.01020

@inproceedings{aee3b92268ea48d1bf22fb97f5a0bf77,

title = "Fine-grained Audible Video Description",

abstract = "We explore a new task for audio-visual-language modeling called fine-grained audible video description (FAVD). It aims to provide detailed textual descriptions for the given audible videos, including the appearance and spatial locations of each object, the actions of moving objects, and the sounds in videos. Existing visual-language modeling tasks often concentrate on visual cues in videos while undervaluing the language and audio modalities. On the other hand, FAVD requires not only audio-visual-language modeling skills but also paragraph-level language generation abilities. We construct the first fine-grained audible video description benchmark (FAVDBench) to facilitate this research. For each video clip, we first provide a one-sentence summary of the video, i.e., the caption, followed by 4-6 sentences describing the visual details and 1-2 audio-related descriptions at the end. The descriptions are provided in both English and Chinese. We create two new metrics for this task: an EntityScore to gauge the completeness of entities in the visual descriptions, and an AudioScore to assess the audio descriptions. As a preliminary approach to this task, we propose an audio-visual-language transformer that extends existing video captioning model with an additional audio branch. We combine the masked language modeling and auto-regressive language modeling losses to optimize our model so that it can produce paragraph-level descriptions. We illustrate the efficiency of our model in audio-visual-language modeling by evaluating it against the proposed benchmark using both conventional captioning metrics and our proposed metrics. We further put our benchmark to the test in video generation models, demonstrating that employing fine-grained video descriptions can create more intricate videos than using captions. Code and dataset are available at https://github.com/OpenNLPLab/FAVDBench. Our online benchmark is available at www.avlbench.opennlplab.cn.",

keywords = "and reasoning, language, Vision",

author = "Xuyang Shen and Dong Li and Jinxing Zhou and Zhen Qin and Bowen He and Xiaodong Han and Aixuan Li and Yuchao Dai and Kong, {Ling Peng} and Meng Wang and Yu Qiao and Yiran Zhong",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 ; Conference date: 18-06-2023 Through 22-06-2023",

year = "2023",

doi = "10.1109/CVPR52729.2023.01020",

language = "英语",

series = "Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition",

publisher = "IEEE Computer Society",

pages = "10585--10596",

booktitle = "Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023",

}

Shen, X, Li, D, Zhou, J, Qin, Z, He, B, Han, X, Li, A, Dai, Y, Kong, LP, Wang, M, Qiao, Y & Zhong, Y 2023, Fine-grained Audible Video Description. 在 Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 卷 2023-June, IEEE Computer Society, 页码 10585-10596, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, 加拿大, 18/06/23. https://doi.org/10.1109/CVPR52729.2023.01020

Fine-grained Audible Video Description. / Shen, Xuyang; Li, Dong; Zhou, Jinxing 等.
Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023. IEEE Computer Society, 2023. 页码 10585-10596 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; 卷 2023-June).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Fine-grained Audible Video Description

AU - Shen, Xuyang

AU - Li, Dong

AU - Zhou, Jinxing

AU - Qin, Zhen

AU - He, Bowen

AU - Han, Xiaodong

AU - Li, Aixuan

AU - Dai, Yuchao

AU - Kong, Ling Peng

AU - Wang, Meng

AU - Qiao, Yu

AU - Zhong, Yiran

PY - 2023

Y1 - 2023

N2 - We explore a new task for audio-visual-language modeling called fine-grained audible video description (FAVD). It aims to provide detailed textual descriptions for the given audible videos, including the appearance and spatial locations of each object, the actions of moving objects, and the sounds in videos. Existing visual-language modeling tasks often concentrate on visual cues in videos while undervaluing the language and audio modalities. On the other hand, FAVD requires not only audio-visual-language modeling skills but also paragraph-level language generation abilities. We construct the first fine-grained audible video description benchmark (FAVDBench) to facilitate this research. For each video clip, we first provide a one-sentence summary of the video, i.e., the caption, followed by 4-6 sentences describing the visual details and 1-2 audio-related descriptions at the end. The descriptions are provided in both English and Chinese. We create two new metrics for this task: an EntityScore to gauge the completeness of entities in the visual descriptions, and an AudioScore to assess the audio descriptions. As a preliminary approach to this task, we propose an audio-visual-language transformer that extends existing video captioning model with an additional audio branch. We combine the masked language modeling and auto-regressive language modeling losses to optimize our model so that it can produce paragraph-level descriptions. We illustrate the efficiency of our model in audio-visual-language modeling by evaluating it against the proposed benchmark using both conventional captioning metrics and our proposed metrics. We further put our benchmark to the test in video generation models, demonstrating that employing fine-grained video descriptions can create more intricate videos than using captions. Code and dataset are available at https://github.com/OpenNLPLab/FAVDBench. Our online benchmark is available at www.avlbench.opennlplab.cn.

AB - We explore a new task for audio-visual-language modeling called fine-grained audible video description (FAVD). It aims to provide detailed textual descriptions for the given audible videos, including the appearance and spatial locations of each object, the actions of moving objects, and the sounds in videos. Existing visual-language modeling tasks often concentrate on visual cues in videos while undervaluing the language and audio modalities. On the other hand, FAVD requires not only audio-visual-language modeling skills but also paragraph-level language generation abilities. We construct the first fine-grained audible video description benchmark (FAVDBench) to facilitate this research. For each video clip, we first provide a one-sentence summary of the video, i.e., the caption, followed by 4-6 sentences describing the visual details and 1-2 audio-related descriptions at the end. The descriptions are provided in both English and Chinese. We create two new metrics for this task: an EntityScore to gauge the completeness of entities in the visual descriptions, and an AudioScore to assess the audio descriptions. As a preliminary approach to this task, we propose an audio-visual-language transformer that extends existing video captioning model with an additional audio branch. We combine the masked language modeling and auto-regressive language modeling losses to optimize our model so that it can produce paragraph-level descriptions. We illustrate the efficiency of our model in audio-visual-language modeling by evaluating it against the proposed benchmark using both conventional captioning metrics and our proposed metrics. We further put our benchmark to the test in video generation models, demonstrating that employing fine-grained video descriptions can create more intricate videos than using captions. Code and dataset are available at https://github.com/OpenNLPLab/FAVDBench. Our online benchmark is available at www.avlbench.opennlplab.cn.

KW - and reasoning

KW - language

KW - Vision

UR - http://www.scopus.com/inward/record.url?scp=85173944154&partnerID=8YFLogxK

U2 - 10.1109/CVPR52729.2023.01020

DO - 10.1109/CVPR52729.2023.01020

M3 - 会议稿件

AN - SCOPUS:85173944154

T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

SP - 10585

EP - 10596

BT - Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023

PB - IEEE Computer Society

T2 - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023

Y2 - 18 June 2023 through 22 June 2023

ER -

Fine-grained Audible Video Description

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此