CNVid-3.5M: Build, Filter, and Pre-Train the Large-Scale Public Chinese Video-Text Dataset

Tian Gan; Qing Wang; Xingning Dong; Xiangyuan Ren; Liqiang Nie; Qingpei Guo

doi:10.1109/CVPR52729.2023.01423

CNVid-3.5M: Build, Filter, and Pre-Train the Large-Scale Public Chinese Video-Text Dataset

Tian Gan, Qing Wang, Xingning Dong, Xiangyuan Ren, Liqiang Nie, Qingpei Guo

Research output: Contribution to journal › Conference article › peer-review

5 Scopus citations

Abstract

Owing to well-designed large-scale video-text datasets, recent years have witnessed tremendous progress in video-text Pre-training. However, existing large-scale video-text datasets are mostly English-only. Though there are certain methods studying the Chinese video-text Pre-training, they pre-train their models on private datasets whose videos and text are unavailable. This lack of large-scale public datasets and benchmarks in Chinese hampers the research and downstream applications of Chinese video-text Pre-training. Towards this end, we release and benchmark CNVid-3.5M, a large-scale public cross-modal dataset containing over 3.5M Chinese video-text pairs. We summarize our contributions by three verbs, i.e., "Build", "Filter", and "Pretrain": 1) To build a public Chinese video-text dataset, we collect over 4.5M videos from the Chinese websites. 2) To improve the data quality, we propose a novel method to filter out 1M weakly-paired videos, resulting in the CNVid-3.5M dataset. And 3) we benchmark CNVid-3.5M with three mainstream pixel-level Pre-training architectures. At last, we propose the Hard Sample Curriculum Learning strategy to promote the Pre-training performance. To the best of our knowledge, CNVid-3.5M is the largest public video-text dataset in Chinese, and we provide the first pixel-level benchmarks for Chinese video-text Pre-training.

Original language	English
Pages (from-to)	14815-14824
Number of pages	10
Journal	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume	2023-June
DOIs	https://doi.org/10.1109/CVPR52729.2023.01423
State	Published - 2023
Externally published	Yes
Event	2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 - Vancouver, Canada Duration: 18 Jun 2023 → 22 Jun 2023

Keywords

Multi-modal learning

Access to Document

10.1109/CVPR52729.2023.01423

Cite this

@article{75f5828c02dc406e919441e627e0c6a3,

title = "CNVid-3.5M: Build, Filter, and Pre-Train the Large-Scale Public Chinese Video-Text Dataset",

abstract = "Owing to well-designed large-scale video-text datasets, recent years have witnessed tremendous progress in video-text Pre-training. However, existing large-scale video-text datasets are mostly English-only. Though there are certain methods studying the Chinese video-text Pre-training, they pre-train their models on private datasets whose videos and text are unavailable. This lack of large-scale public datasets and benchmarks in Chinese hampers the research and downstream applications of Chinese video-text Pre-training. Towards this end, we release and benchmark CNVid-3.5M, a large-scale public cross-modal dataset containing over 3.5M Chinese video-text pairs. We summarize our contributions by three verbs, i.e., {"}Build{"}, {"}Filter{"}, and {"}Pretrain{"}: 1) To build a public Chinese video-text dataset, we collect over 4.5M videos from the Chinese websites. 2) To improve the data quality, we propose a novel method to filter out 1M weakly-paired videos, resulting in the CNVid-3.5M dataset. And 3) we benchmark CNVid-3.5M with three mainstream pixel-level Pre-training architectures. At last, we propose the Hard Sample Curriculum Learning strategy to promote the Pre-training performance. To the best of our knowledge, CNVid-3.5M is the largest public video-text dataset in Chinese, and we provide the first pixel-level benchmarks for Chinese video-text Pre-training.",

keywords = "Multi-modal learning",

author = "Tian Gan and Qing Wang and Xingning Dong and Xiangyuan Ren and Liqiang Nie and Qingpei Guo",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 ; Conference date: 18-06-2023 Through 22-06-2023",

year = "2023",

doi = "10.1109/CVPR52729.2023.01423",

language = "英语",

volume = "2023-June",

pages = "14815--14824",

journal = "Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition",

issn = "1063-6919",

publisher = "IEEE Computer Society",

}

TY - JOUR

T1 - CNVid-3.5M

T2 - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023

AU - Gan, Tian

AU - Wang, Qing

AU - Dong, Xingning

AU - Ren, Xiangyuan

AU - Nie, Liqiang

AU - Guo, Qingpei

PY - 2023

Y1 - 2023

N2 - Owing to well-designed large-scale video-text datasets, recent years have witnessed tremendous progress in video-text Pre-training. However, existing large-scale video-text datasets are mostly English-only. Though there are certain methods studying the Chinese video-text Pre-training, they pre-train their models on private datasets whose videos and text are unavailable. This lack of large-scale public datasets and benchmarks in Chinese hampers the research and downstream applications of Chinese video-text Pre-training. Towards this end, we release and benchmark CNVid-3.5M, a large-scale public cross-modal dataset containing over 3.5M Chinese video-text pairs. We summarize our contributions by three verbs, i.e., "Build", "Filter", and "Pretrain": 1) To build a public Chinese video-text dataset, we collect over 4.5M videos from the Chinese websites. 2) To improve the data quality, we propose a novel method to filter out 1M weakly-paired videos, resulting in the CNVid-3.5M dataset. And 3) we benchmark CNVid-3.5M with three mainstream pixel-level Pre-training architectures. At last, we propose the Hard Sample Curriculum Learning strategy to promote the Pre-training performance. To the best of our knowledge, CNVid-3.5M is the largest public video-text dataset in Chinese, and we provide the first pixel-level benchmarks for Chinese video-text Pre-training.

AB - Owing to well-designed large-scale video-text datasets, recent years have witnessed tremendous progress in video-text Pre-training. However, existing large-scale video-text datasets are mostly English-only. Though there are certain methods studying the Chinese video-text Pre-training, they pre-train their models on private datasets whose videos and text are unavailable. This lack of large-scale public datasets and benchmarks in Chinese hampers the research and downstream applications of Chinese video-text Pre-training. Towards this end, we release and benchmark CNVid-3.5M, a large-scale public cross-modal dataset containing over 3.5M Chinese video-text pairs. We summarize our contributions by three verbs, i.e., "Build", "Filter", and "Pretrain": 1) To build a public Chinese video-text dataset, we collect over 4.5M videos from the Chinese websites. 2) To improve the data quality, we propose a novel method to filter out 1M weakly-paired videos, resulting in the CNVid-3.5M dataset. And 3) we benchmark CNVid-3.5M with three mainstream pixel-level Pre-training architectures. At last, we propose the Hard Sample Curriculum Learning strategy to promote the Pre-training performance. To the best of our knowledge, CNVid-3.5M is the largest public video-text dataset in Chinese, and we provide the first pixel-level benchmarks for Chinese video-text Pre-training.

KW - Multi-modal learning

UR - http://www.scopus.com/inward/record.url?scp=85204108584&partnerID=8YFLogxK

U2 - 10.1109/CVPR52729.2023.01423

DO - 10.1109/CVPR52729.2023.01423

M3 - 会议文章

AN - SCOPUS:85204108584

SN - 1063-6919

VL - 2023-June

SP - 14815

EP - 14824

JO - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

JF - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Y2 - 18 June 2023 through 22 June 2023

ER -

CNVid-3.5M: Build, Filter, and Pre-Train the Large-Scale Public Chinese Video-Text Dataset

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this