Self-Supervised Learning for Heterogeneous Audiovisual Scene Analysis

Di Hu; Zheng Wang; Feiping Nie; Rong Wang; Xuelong Li

doi:10.1109/TMM.2022.3162477

Self-Supervised Learning for Heterogeneous Audiovisual Scene Analysis

Di Hu, Zheng Wang, Feiping Nie, Rong Wang, Xuelong Li

School of Artificial Intelligence, OPtics and Electronics

Research output: Contribution to journal › Article › peer-review

2 Scopus citations

Abstract

Due to the difficulty of annotating large amounts of training data, directly learning the association of sound and its makers in natural videos is a challenging task for machines. In this paper, we present a novel audiovisual model that introduces a soft-clustering module as the audio and visual content detector, and regards the pervasive property of audiovisual concurrency as the latent supervision for inferring the correlation among detected contents. Furthermore, we discover for the first time that the complexity of data has an impact on the training efficiency and subsequent performance of audiovisual model, i.e., more complex data brings more obstacles to the model training, and degrades the performance of downstream audiovisual tasks. To address the issue of audiovisual learning, we propose a novel heterogeneous audiovisual scene analysis module that trains the model from simple to complex scene. We show that such ordered learning procedure rewards the model the merits of easy training and fast convergence. Meanwhile, our audiovisual model can also provide effective unimodal representation and cross-modal alignment performance. We further deploy the well-trained model into practical audiovisual sound localization and separation tasks. We show that our localization model significantly outperforms existing methods, based on which we show comparable performance in sound separation task by comparison to several related SOTA audiovisual learning methods without referring external visualsupervision.

Original language	English
Pages (from-to)	3534-3545
Number of pages	12
Journal	IEEE Transactions on Multimedia
Volume	25
DOIs	https://doi.org/10.1109/TMM.2022.3162477
State	Published - 2023

Keywords

Audiovisual sound localization and separation
heterogenous audiovisual scene analysis
multimodal audiovisual learning

Access to Document

10.1109/TMM.2022.3162477

Cite this

@article{3ad8b09ca22b4993a381a41571d3c306,

title = "Self-Supervised Learning for Heterogeneous Audiovisual Scene Analysis",

abstract = "Due to the difficulty of annotating large amounts of training data, directly learning the association of sound and its makers in natural videos is a challenging task for machines. In this paper, we present a novel audiovisual model that introduces a soft-clustering module as the audio and visual content detector, and regards the pervasive property of audiovisual concurrency as the latent supervision for inferring the correlation among detected contents. Furthermore, we discover for the first time that the complexity of data has an impact on the training efficiency and subsequent performance of audiovisual model, i.e., more complex data brings more obstacles to the model training, and degrades the performance of downstream audiovisual tasks. To address the issue of audiovisual learning, we propose a novel heterogeneous audiovisual scene analysis module that trains the model from simple to complex scene. We show that such ordered learning procedure rewards the model the merits of easy training and fast convergence. Meanwhile, our audiovisual model can also provide effective unimodal representation and cross-modal alignment performance. We further deploy the well-trained model into practical audiovisual sound localization and separation tasks. We show that our localization model significantly outperforms existing methods, based on which we show comparable performance in sound separation task by comparison to several related SOTA audiovisual learning methods without referring external visualsupervision.",

keywords = "Audiovisual sound localization and separation, heterogenous audiovisual scene analysis, multimodal audiovisual learning",

author = "Di Hu and Zheng Wang and Feiping Nie and Rong Wang and Xuelong Li",

note = "Publisher Copyright: {\textcopyright} 1999-2012 IEEE.",

year = "2023",

doi = "10.1109/TMM.2022.3162477",

language = "英语",

volume = "25",

pages = "3534--3545",

journal = "IEEE Transactions on Multimedia",

issn = "1520-9210",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Self-Supervised Learning for Heterogeneous Audiovisual Scene Analysis

AU - Hu, Di

AU - Wang, Zheng

AU - Nie, Feiping

AU - Wang, Rong

AU - Li, Xuelong

PY - 2023

Y1 - 2023

N2 - Due to the difficulty of annotating large amounts of training data, directly learning the association of sound and its makers in natural videos is a challenging task for machines. In this paper, we present a novel audiovisual model that introduces a soft-clustering module as the audio and visual content detector, and regards the pervasive property of audiovisual concurrency as the latent supervision for inferring the correlation among detected contents. Furthermore, we discover for the first time that the complexity of data has an impact on the training efficiency and subsequent performance of audiovisual model, i.e., more complex data brings more obstacles to the model training, and degrades the performance of downstream audiovisual tasks. To address the issue of audiovisual learning, we propose a novel heterogeneous audiovisual scene analysis module that trains the model from simple to complex scene. We show that such ordered learning procedure rewards the model the merits of easy training and fast convergence. Meanwhile, our audiovisual model can also provide effective unimodal representation and cross-modal alignment performance. We further deploy the well-trained model into practical audiovisual sound localization and separation tasks. We show that our localization model significantly outperforms existing methods, based on which we show comparable performance in sound separation task by comparison to several related SOTA audiovisual learning methods without referring external visualsupervision.

AB - Due to the difficulty of annotating large amounts of training data, directly learning the association of sound and its makers in natural videos is a challenging task for machines. In this paper, we present a novel audiovisual model that introduces a soft-clustering module as the audio and visual content detector, and regards the pervasive property of audiovisual concurrency as the latent supervision for inferring the correlation among detected contents. Furthermore, we discover for the first time that the complexity of data has an impact on the training efficiency and subsequent performance of audiovisual model, i.e., more complex data brings more obstacles to the model training, and degrades the performance of downstream audiovisual tasks. To address the issue of audiovisual learning, we propose a novel heterogeneous audiovisual scene analysis module that trains the model from simple to complex scene. We show that such ordered learning procedure rewards the model the merits of easy training and fast convergence. Meanwhile, our audiovisual model can also provide effective unimodal representation and cross-modal alignment performance. We further deploy the well-trained model into practical audiovisual sound localization and separation tasks. We show that our localization model significantly outperforms existing methods, based on which we show comparable performance in sound separation task by comparison to several related SOTA audiovisual learning methods without referring external visualsupervision.

KW - Audiovisual sound localization and separation

KW - heterogenous audiovisual scene analysis

KW - multimodal audiovisual learning

UR - http://www.scopus.com/inward/record.url?scp=85172681155&partnerID=8YFLogxK

U2 - 10.1109/TMM.2022.3162477

DO - 10.1109/TMM.2022.3162477

M3 - 文章

AN - SCOPUS:85172681155

SN - 1520-9210

VL - 25

SP - 3534

EP - 3545

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

ER -

Self-Supervised Learning for Heterogeneous Audiovisual Scene Analysis

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this