SNP-S3: Shared Network Pre-Training and Significant Semantic Strengthening for Various Video-Text Tasks

Xingning Dong; Qingpei Guo; Tian Gan; Qing Wang; Jianlong Wu; Xiangyuan Ren; Yuan Cheng; Wei Chu

doi:10.1109/TCSVT.2023.3303945

SNP-S³: Shared Network Pre-Training and Significant Semantic Strengthening for Various Video-Text Tasks

Xingning Dong, Qingpei Guo, Tian Gan, Qing Wang, Jianlong Wu, Xiangyuan Ren, Yuan Cheng, Wei Chu

科研成果: 期刊稿件 › 文章 › 同行评审

3 引用（Scopus）

摘要

We present a framework for learning cross-modal video representations by directly pre-training on raw data to facilitate various downstream video-text tasks. Our main contributions lie in the pre-training framework and proxy tasks. First, based on the shortcomings of two mainstream pixel-level pre-training architectures (limited applications or less efficient), we propose Shared Network Pre-training (SNP). By employing one shared BERT-type network to refine textual and cross-modal features simultaneously, SNP is lightweight and could support various downstream applications. Second, based on the intuition that people always pay attention to several “significant words” when understanding a sentence, we propose the Significant Semantic Strengthening (S³) strategy, which includes a novel masking and matching proxy task to promote the pre-training performance. Experiments conducted on three downstream video-text tasks and six datasets demonstrate that, we establish a new state-of-the-art in pixel-level video-text pre-training; we also achieve a satisfactory balance between the pre-training efficiency and the fine-tuning performance. The codebase and pre-trained models are available at https://github.com/dongxingning/SNPS3.

源语言	英语
页（从-至）	2525-2535
页数	11
期刊	IEEE Transactions on Circuits and Systems for Video Technology
卷	34
期	4
DOI	https://doi.org/10.1109/TCSVT.2023.3303945
出版状态	已出版 - 1 4月 2024
已对外发布	是

访问文件

10.1109/TCSVT.2023.3303945

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{118ff27711514e2a8f83c3cac0062787,

title = "SNP-S3: Shared Network Pre-Training and Significant Semantic Strengthening for Various Video-Text Tasks",

abstract = "We present a framework for learning cross-modal video representations by directly pre-training on raw data to facilitate various downstream video-text tasks. Our main contributions lie in the pre-training framework and proxy tasks. First, based on the shortcomings of two mainstream pixel-level pre-training architectures (limited applications or less efficient), we propose Shared Network Pre-training (SNP). By employing one shared BERT-type network to refine textual and cross-modal features simultaneously, SNP is lightweight and could support various downstream applications. Second, based on the intuition that people always pay attention to several “significant words” when understanding a sentence, we propose the Significant Semantic Strengthening (S3) strategy, which includes a novel masking and matching proxy task to promote the pre-training performance. Experiments conducted on three downstream video-text tasks and six datasets demonstrate that, we establish a new state-of-the-art in pixel-level video-text pre-training; we also achieve a satisfactory balance between the pre-training efficiency and the fine-tuning performance. The codebase and pre-trained models are available at https://github.com/dongxingning/SNPS3.",

keywords = "masked language modeling, video-text matching, Video-text pre-training, vision and language",

author = "Xingning Dong and Qingpei Guo and Tian Gan and Qing Wang and Jianlong Wu and Xiangyuan Ren and Yuan Cheng and Wei Chu",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.",

year = "2024",

month = apr,

day = "1",

doi = "10.1109/TCSVT.2023.3303945",

language = "英语",

volume = "34",

pages = "2525--2535",

journal = "IEEE Transactions on Circuits and Systems for Video Technology",

issn = "1051-8215",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "4",

}

TY - JOUR

T1 - SNP-S3

T2 - Shared Network Pre-Training and Significant Semantic Strengthening for Various Video-Text Tasks

AU - Dong, Xingning

AU - Guo, Qingpei

AU - Gan, Tian

AU - Wang, Qing

AU - Wu, Jianlong

AU - Ren, Xiangyuan

AU - Cheng, Yuan

AU - Chu, Wei

PY - 2024/4/1

Y1 - 2024/4/1

N2 - We present a framework for learning cross-modal video representations by directly pre-training on raw data to facilitate various downstream video-text tasks. Our main contributions lie in the pre-training framework and proxy tasks. First, based on the shortcomings of two mainstream pixel-level pre-training architectures (limited applications or less efficient), we propose Shared Network Pre-training (SNP). By employing one shared BERT-type network to refine textual and cross-modal features simultaneously, SNP is lightweight and could support various downstream applications. Second, based on the intuition that people always pay attention to several “significant words” when understanding a sentence, we propose the Significant Semantic Strengthening (S3) strategy, which includes a novel masking and matching proxy task to promote the pre-training performance. Experiments conducted on three downstream video-text tasks and six datasets demonstrate that, we establish a new state-of-the-art in pixel-level video-text pre-training; we also achieve a satisfactory balance between the pre-training efficiency and the fine-tuning performance. The codebase and pre-trained models are available at https://github.com/dongxingning/SNPS3.

AB - We present a framework for learning cross-modal video representations by directly pre-training on raw data to facilitate various downstream video-text tasks. Our main contributions lie in the pre-training framework and proxy tasks. First, based on the shortcomings of two mainstream pixel-level pre-training architectures (limited applications or less efficient), we propose Shared Network Pre-training (SNP). By employing one shared BERT-type network to refine textual and cross-modal features simultaneously, SNP is lightweight and could support various downstream applications. Second, based on the intuition that people always pay attention to several “significant words” when understanding a sentence, we propose the Significant Semantic Strengthening (S3) strategy, which includes a novel masking and matching proxy task to promote the pre-training performance. Experiments conducted on three downstream video-text tasks and six datasets demonstrate that, we establish a new state-of-the-art in pixel-level video-text pre-training; we also achieve a satisfactory balance between the pre-training efficiency and the fine-tuning performance. The codebase and pre-trained models are available at https://github.com/dongxingning/SNPS3.

KW - masked language modeling

KW - video-text matching

KW - Video-text pre-training

KW - vision and language

UR - http://www.scopus.com/inward/record.url?scp=85167805295&partnerID=8YFLogxK

U2 - 10.1109/TCSVT.2023.3303945

DO - 10.1109/TCSVT.2023.3303945

M3 - 文章

AN - SCOPUS:85167805295

SN - 1051-8215

VL - 34

SP - 2525

EP - 2535

JO - IEEE Transactions on Circuits and Systems for Video Technology

JF - IEEE Transactions on Circuits and Systems for Video Technology

IS - 4

ER -

SNP-S3: Shared Network Pre-Training and Significant Semantic Strengthening for Various Video-Text Tasks

摘要

访问文件

其它文件与链接

指纹

引用此

SNP-S³: Shared Network Pre-Training and Significant Semantic Strengthening for Various Video-Text Tasks