Self-Supervised Learning for Heterogeneous Audiovisual Scene Analysis

Di Hu, Zheng Wang, Feiping Nie, Rong Wang, Xuelong Li

科研成果: 期刊稿件文章同行评审

2 引用 (Scopus)

摘要

Due to the difficulty of annotating large amounts of training data, directly learning the association of sound and its makers in natural videos is a challenging task for machines. In this paper, we present a novel audiovisual model that introduces a soft-clustering module as the audio and visual content detector, and regards the pervasive property of audiovisual concurrency as the latent supervision for inferring the correlation among detected contents. Furthermore, we discover for the first time that the complexity of data has an impact on the training efficiency and subsequent performance of audiovisual model, i.e., more complex data brings more obstacles to the model training, and degrades the performance of downstream audiovisual tasks. To address the issue of audiovisual learning, we propose a novel heterogeneous audiovisual scene analysis module that trains the model from simple to complex scene. We show that such ordered learning procedure rewards the model the merits of easy training and fast convergence. Meanwhile, our audiovisual model can also provide effective unimodal representation and cross-modal alignment performance. We further deploy the well-trained model into practical audiovisual sound localization and separation tasks. We show that our localization model significantly outperforms existing methods, based on which we show comparable performance in sound separation task by comparison to several related SOTA audiovisual learning methods without referring external visualsupervision.

源语言英语
页(从-至)3534-3545
页数12
期刊IEEE Transactions on Multimedia
25
DOI
出版状态已出版 - 2023

指纹

探究 'Self-Supervised Learning for Heterogeneous Audiovisual Scene Analysis' 的科研主题。它们共同构成独一无二的指纹。

引用此