Self-Supervised Learning for Heterogeneous Audiovisual Scene Analysis

Di Hu, Zheng Wang, Feiping Nie, Rong Wang, Xuelong Li

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

Due to the difficulty of annotating large amounts of training data, directly learning the association of sound and its makers in natural videos is a challenging task for machines. In this paper, we present a novel audiovisual model that introduces a soft-clustering module as the audio and visual content detector, and regards the pervasive property of audiovisual concurrency as the latent supervision for inferring the correlation among detected contents. Furthermore, we discover for the first time that the complexity of data has an impact on the training efficiency and subsequent performance of audiovisual model, i.e., more complex data brings more obstacles to the model training, and degrades the performance of downstream audiovisual tasks. To address the issue of audiovisual learning, we propose a novel heterogeneous audiovisual scene analysis module that trains the model from simple to complex scene. We show that such ordered learning procedure rewards the model the merits of easy training and fast convergence. Meanwhile, our audiovisual model can also provide effective unimodal representation and cross-modal alignment performance. We further deploy the well-trained model into practical audiovisual sound localization and separation tasks. We show that our localization model significantly outperforms existing methods, based on which we show comparable performance in sound separation task by comparison to several related SOTA audiovisual learning methods without referring external visualsupervision.

Original languageEnglish
Pages (from-to)3534-3545
Number of pages12
JournalIEEE Transactions on Multimedia
Volume25
DOIs
StatePublished - 2023

Keywords

  • Audiovisual sound localization and separation
  • heterogenous audiovisual scene analysis
  • multimodal audiovisual learning

Fingerprint

Dive into the research topics of 'Self-Supervised Learning for Heterogeneous Audiovisual Scene Analysis'. Together they form a unique fingerprint.

Cite this