Bio-Inspired Audiovisual Multi-Representation Integration via Self-Supervised Learning

Zhaojian Li, Bin Zhao, Yuan Yuan

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

Audiovisual self-supervised representation learning has made significant strides in various audiovisual tasks. Existing methods mostly focus on single representation modeling between audio and visual modalities, ignoring the complex correspondence between them, resulting in the inability to execute cross-modal understanding in a more natural audiovisual scene. Several biological studies have shown that human learning is influenced by multi-layered synchronization of perception. To this end, inspired by biology, we argue to exploit the naturally existing relationships in audio and visual modalities to learn audiovisual representations under multilayer perceptual integration. Firstly, we introduce an audiovisual multi-representation pretext task that integrates semantic consistency, temporal alignment, and spatial correspondence. Secondly, we propose a self-supervised audiovisual multi-representation learning approach, which simultaneously learns the perceptual relationship between visual and audio modalities at semantic, temporal, and spatial levels. To establish fine-grained correspondence between visual objects and sounds, an audiovisual object detection module is proposed, which detects potential sounding objects by combining unsupervised knowledge at multiple levels. In addition, we propose a modality-wise loss and a task-wise loss to learn a subspace-orthogonal representation space that makes representation relations more discriminative. Finally, experimental results demonstrate that collectively understanding the semantic, temporal, and spatial correspondence between audiovisual modalities enables the model to perform better on downstream tasks such as sound separation, sound spatialization, and audiovisual segmentation.

Original languageEnglish
Title of host publicationMM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages3755-3764
Number of pages10
ISBN (Electronic)9798400701085
DOIs
StatePublished - 26 Oct 2023
Event31st ACM International Conference on Multimedia, MM 2023 - Ottawa, Canada
Duration: 29 Oct 20233 Nov 2023

Publication series

NameMM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

Conference

Conference31st ACM International Conference on Multimedia, MM 2023
Country/TerritoryCanada
CityOttawa
Period29/10/233/11/23

Keywords

  • audiovisual learning
  • contrastive learning
  • representation learning
  • self-supervised learning

Fingerprint

Dive into the research topics of 'Bio-Inspired Audiovisual Multi-Representation Integration via Self-Supervised Learning'. Together they form a unique fingerprint.

Cite this