Self-Supervised Learning for Heterogeneous Audiovisual Scene Analysis

Di Hu; Zheng Wang; Feiping Nie; Rong Wang; Xuelong Li

doi:10.1109/TMM.2022.3162477

Self-Supervised Learning for Heterogeneous Audiovisual Scene Analysis

Di Hu, Zheng Wang, Feiping Nie, Rong Wang, Xuelong Li

光电与智能研究院

科研成果: 期刊稿件 › 文章 › 同行评审

2 引用（Scopus）

摘要

Due to the difficulty of annotating large amounts of training data, directly learning the association of sound and its makers in natural videos is a challenging task for machines. In this paper, we present a novel audiovisual model that introduces a soft-clustering module as the audio and visual content detector, and regards the pervasive property of audiovisual concurrency as the latent supervision for inferring the correlation among detected contents. Furthermore, we discover for the first time that the complexity of data has an impact on the training efficiency and subsequent performance of audiovisual model, i.e., more complex data brings more obstacles to the model training, and degrades the performance of downstream audiovisual tasks. To address the issue of audiovisual learning, we propose a novel heterogeneous audiovisual scene analysis module that trains the model from simple to complex scene. We show that such ordered learning procedure rewards the model the merits of easy training and fast convergence. Meanwhile, our audiovisual model can also provide effective unimodal representation and cross-modal alignment performance. We further deploy the well-trained model into practical audiovisual sound localization and separation tasks. We show that our localization model significantly outperforms existing methods, based on which we show comparable performance in sound separation task by comparison to several related SOTA audiovisual learning methods without referring external visualsupervision.

源语言	英语
页（从-至）	3534-3545
页数	12
期刊	IEEE Transactions on Multimedia
卷	25
DOI	https://doi.org/10.1109/TMM.2022.3162477
出版状态	已出版 - 2023

访问文件

10.1109/TMM.2022.3162477

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{3ad8b09ca22b4993a381a41571d3c306,

title = "Self-Supervised Learning for Heterogeneous Audiovisual Scene Analysis",

abstract = "Due to the difficulty of annotating large amounts of training data, directly learning the association of sound and its makers in natural videos is a challenging task for machines. In this paper, we present a novel audiovisual model that introduces a soft-clustering module as the audio and visual content detector, and regards the pervasive property of audiovisual concurrency as the latent supervision for inferring the correlation among detected contents. Furthermore, we discover for the first time that the complexity of data has an impact on the training efficiency and subsequent performance of audiovisual model, i.e., more complex data brings more obstacles to the model training, and degrades the performance of downstream audiovisual tasks. To address the issue of audiovisual learning, we propose a novel heterogeneous audiovisual scene analysis module that trains the model from simple to complex scene. We show that such ordered learning procedure rewards the model the merits of easy training and fast convergence. Meanwhile, our audiovisual model can also provide effective unimodal representation and cross-modal alignment performance. We further deploy the well-trained model into practical audiovisual sound localization and separation tasks. We show that our localization model significantly outperforms existing methods, based on which we show comparable performance in sound separation task by comparison to several related SOTA audiovisual learning methods without referring external visualsupervision.",

keywords = "Audiovisual sound localization and separation, heterogenous audiovisual scene analysis, multimodal audiovisual learning",

author = "Di Hu and Zheng Wang and Feiping Nie and Rong Wang and Xuelong Li",

note = "Publisher Copyright: {\textcopyright} 1999-2012 IEEE.",

year = "2023",

doi = "10.1109/TMM.2022.3162477",

language = "英语",

volume = "25",

pages = "3534--3545",

journal = "IEEE Transactions on Multimedia",

issn = "1520-9210",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Self-Supervised Learning for Heterogeneous Audiovisual Scene Analysis

AU - Hu, Di

AU - Wang, Zheng

AU - Nie, Feiping

AU - Wang, Rong

AU - Li, Xuelong

PY - 2023

Y1 - 2023

N2 - Due to the difficulty of annotating large amounts of training data, directly learning the association of sound and its makers in natural videos is a challenging task for machines. In this paper, we present a novel audiovisual model that introduces a soft-clustering module as the audio and visual content detector, and regards the pervasive property of audiovisual concurrency as the latent supervision for inferring the correlation among detected contents. Furthermore, we discover for the first time that the complexity of data has an impact on the training efficiency and subsequent performance of audiovisual model, i.e., more complex data brings more obstacles to the model training, and degrades the performance of downstream audiovisual tasks. To address the issue of audiovisual learning, we propose a novel heterogeneous audiovisual scene analysis module that trains the model from simple to complex scene. We show that such ordered learning procedure rewards the model the merits of easy training and fast convergence. Meanwhile, our audiovisual model can also provide effective unimodal representation and cross-modal alignment performance. We further deploy the well-trained model into practical audiovisual sound localization and separation tasks. We show that our localization model significantly outperforms existing methods, based on which we show comparable performance in sound separation task by comparison to several related SOTA audiovisual learning methods without referring external visualsupervision.

AB - Due to the difficulty of annotating large amounts of training data, directly learning the association of sound and its makers in natural videos is a challenging task for machines. In this paper, we present a novel audiovisual model that introduces a soft-clustering module as the audio and visual content detector, and regards the pervasive property of audiovisual concurrency as the latent supervision for inferring the correlation among detected contents. Furthermore, we discover for the first time that the complexity of data has an impact on the training efficiency and subsequent performance of audiovisual model, i.e., more complex data brings more obstacles to the model training, and degrades the performance of downstream audiovisual tasks. To address the issue of audiovisual learning, we propose a novel heterogeneous audiovisual scene analysis module that trains the model from simple to complex scene. We show that such ordered learning procedure rewards the model the merits of easy training and fast convergence. Meanwhile, our audiovisual model can also provide effective unimodal representation and cross-modal alignment performance. We further deploy the well-trained model into practical audiovisual sound localization and separation tasks. We show that our localization model significantly outperforms existing methods, based on which we show comparable performance in sound separation task by comparison to several related SOTA audiovisual learning methods without referring external visualsupervision.

KW - Audiovisual sound localization and separation

KW - heterogenous audiovisual scene analysis

KW - multimodal audiovisual learning

UR - http://www.scopus.com/inward/record.url?scp=85172681155&partnerID=8YFLogxK

U2 - 10.1109/TMM.2022.3162477

DO - 10.1109/TMM.2022.3162477

M3 - 文章

AN - SCOPUS:85172681155

SN - 1520-9210

VL - 25

SP - 3534

EP - 3545

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

ER -

Self-Supervised Learning for Heterogeneous Audiovisual Scene Analysis

摘要

访问文件

其它文件与链接

指纹

引用此