TY - JOUR
T1 - SNP-S3
T2 - Shared Network Pre-Training and Significant Semantic Strengthening for Various Video-Text Tasks
AU - Dong, Xingning
AU - Guo, Qingpei
AU - Gan, Tian
AU - Wang, Qing
AU - Wu, Jianlong
AU - Ren, Xiangyuan
AU - Cheng, Yuan
AU - Chu, Wei
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2024/4/1
Y1 - 2024/4/1
N2 - We present a framework for learning cross-modal video representations by directly pre-training on raw data to facilitate various downstream video-text tasks. Our main contributions lie in the pre-training framework and proxy tasks. First, based on the shortcomings of two mainstream pixel-level pre-training architectures (limited applications or less efficient), we propose Shared Network Pre-training (SNP). By employing one shared BERT-type network to refine textual and cross-modal features simultaneously, SNP is lightweight and could support various downstream applications. Second, based on the intuition that people always pay attention to several “significant words” when understanding a sentence, we propose the Significant Semantic Strengthening (S3) strategy, which includes a novel masking and matching proxy task to promote the pre-training performance. Experiments conducted on three downstream video-text tasks and six datasets demonstrate that, we establish a new state-of-the-art in pixel-level video-text pre-training; we also achieve a satisfactory balance between the pre-training efficiency and the fine-tuning performance. The codebase and pre-trained models are available at https://github.com/dongxingning/SNPS3.
AB - We present a framework for learning cross-modal video representations by directly pre-training on raw data to facilitate various downstream video-text tasks. Our main contributions lie in the pre-training framework and proxy tasks. First, based on the shortcomings of two mainstream pixel-level pre-training architectures (limited applications or less efficient), we propose Shared Network Pre-training (SNP). By employing one shared BERT-type network to refine textual and cross-modal features simultaneously, SNP is lightweight and could support various downstream applications. Second, based on the intuition that people always pay attention to several “significant words” when understanding a sentence, we propose the Significant Semantic Strengthening (S3) strategy, which includes a novel masking and matching proxy task to promote the pre-training performance. Experiments conducted on three downstream video-text tasks and six datasets demonstrate that, we establish a new state-of-the-art in pixel-level video-text pre-training; we also achieve a satisfactory balance between the pre-training efficiency and the fine-tuning performance. The codebase and pre-trained models are available at https://github.com/dongxingning/SNPS3.
KW - masked language modeling
KW - video-text matching
KW - Video-text pre-training
KW - vision and language
UR - http://www.scopus.com/inward/record.url?scp=85167805295&partnerID=8YFLogxK
U2 - 10.1109/TCSVT.2023.3303945
DO - 10.1109/TCSVT.2023.3303945
M3 - 文章
AN - SCOPUS:85167805295
SN - 1051-8215
VL - 34
SP - 2525
EP - 2535
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 4
ER -