SNP-S3: Shared Network Pre-Training and Significant Semantic Strengthening for Various Video-Text Tasks

Xingning Dong, Qingpei Guo, Tian Gan, Qing Wang, Jianlong Wu, Xiangyuan Ren, Yuan Cheng, Wei Chu

Research output: Contribution to journalArticlepeer-review

3 Scopus citations

Abstract

We present a framework for learning cross-modal video representations by directly pre-training on raw data to facilitate various downstream video-text tasks. Our main contributions lie in the pre-training framework and proxy tasks. First, based on the shortcomings of two mainstream pixel-level pre-training architectures (limited applications or less efficient), we propose Shared Network Pre-training (SNP). By employing one shared BERT-type network to refine textual and cross-modal features simultaneously, SNP is lightweight and could support various downstream applications. Second, based on the intuition that people always pay attention to several “significant words” when understanding a sentence, we propose the Significant Semantic Strengthening (S3) strategy, which includes a novel masking and matching proxy task to promote the pre-training performance. Experiments conducted on three downstream video-text tasks and six datasets demonstrate that, we establish a new state-of-the-art in pixel-level video-text pre-training; we also achieve a satisfactory balance between the pre-training efficiency and the fine-tuning performance. The codebase and pre-trained models are available at https://github.com/dongxingning/SNPS3.

Original languageEnglish
Pages (from-to)2525-2535
Number of pages11
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume34
Issue number4
DOIs
StatePublished - 1 Apr 2024
Externally publishedYes

Keywords

  • masked language modeling
  • video-text matching
  • Video-text pre-training
  • vision and language

Fingerprint

Dive into the research topics of 'SNP-S3: Shared Network Pre-Training and Significant Semantic Strengthening for Various Video-Text Tasks'. Together they form a unique fingerprint.

Cite this