Bio-Inspired Audiovisual Multi-Representation Integration via Self-Supervised Learning

Zhaojian Li; Bin Zhao; Yuan Yuan

doi:10.1145/3581783.3612428

Bio-Inspired Audiovisual Multi-Representation Integration via Self-Supervised Learning

Zhaojian Li, Bin Zhao, Yuan Yuan

光电与智能研究院

Northwestern Polytechnical University Xian

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

4 引用（Scopus）

摘要

Audiovisual self-supervised representation learning has made significant strides in various audiovisual tasks. Existing methods mostly focus on single representation modeling between audio and visual modalities, ignoring the complex correspondence between them, resulting in the inability to execute cross-modal understanding in a more natural audiovisual scene. Several biological studies have shown that human learning is influenced by multi-layered synchronization of perception. To this end, inspired by biology, we argue to exploit the naturally existing relationships in audio and visual modalities to learn audiovisual representations under multilayer perceptual integration. Firstly, we introduce an audiovisual multi-representation pretext task that integrates semantic consistency, temporal alignment, and spatial correspondence. Secondly, we propose a self-supervised audiovisual multi-representation learning approach, which simultaneously learns the perceptual relationship between visual and audio modalities at semantic, temporal, and spatial levels. To establish fine-grained correspondence between visual objects and sounds, an audiovisual object detection module is proposed, which detects potential sounding objects by combining unsupervised knowledge at multiple levels. In addition, we propose a modality-wise loss and a task-wise loss to learn a subspace-orthogonal representation space that makes representation relations more discriminative. Finally, experimental results demonstrate that collectively understanding the semantic, temporal, and spatial correspondence between audiovisual modalities enables the model to perform better on downstream tasks such as sound separation, sound spatialization, and audiovisual segmentation.

源语言	英语
主期刊名	MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
出版商	Association for Computing Machinery, Inc
页	3755-3764
页数	10
ISBN（电子版）	9798400701085
DOI	https://doi.org/10.1145/3581783.3612428
出版状态	已出版 - 26 10月 2023
活动	31st ACM International Conference on Multimedia, MM 2023 - Ottawa, 加拿大期限: 29 10月 2023 → 3 11月 2023

出版系列

姓名	MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

会议

会议	31st ACM International Conference on Multimedia, MM 2023
国家/地区	加拿大
市	Ottawa
时期	29/10/23 → 3/11/23

访问文件

10.1145/3581783.3612428

其它文件与链接

链接到 Scopus 的出版物

引用此

Li, Z., Zhao, B., & Yuan, Y. (2023). Bio-Inspired Audiovisual Multi-Representation Integration via Self-Supervised Learning. 在 MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia (页码 3755-3764). (MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia). Association for Computing Machinery, Inc. https://doi.org/10.1145/3581783.3612428

@inproceedings{618bf3c7f532436fac24db22c0b35662,

title = "Bio-Inspired Audiovisual Multi-Representation Integration via Self-Supervised Learning",

abstract = "Audiovisual self-supervised representation learning has made significant strides in various audiovisual tasks. Existing methods mostly focus on single representation modeling between audio and visual modalities, ignoring the complex correspondence between them, resulting in the inability to execute cross-modal understanding in a more natural audiovisual scene. Several biological studies have shown that human learning is influenced by multi-layered synchronization of perception. To this end, inspired by biology, we argue to exploit the naturally existing relationships in audio and visual modalities to learn audiovisual representations under multilayer perceptual integration. Firstly, we introduce an audiovisual multi-representation pretext task that integrates semantic consistency, temporal alignment, and spatial correspondence. Secondly, we propose a self-supervised audiovisual multi-representation learning approach, which simultaneously learns the perceptual relationship between visual and audio modalities at semantic, temporal, and spatial levels. To establish fine-grained correspondence between visual objects and sounds, an audiovisual object detection module is proposed, which detects potential sounding objects by combining unsupervised knowledge at multiple levels. In addition, we propose a modality-wise loss and a task-wise loss to learn a subspace-orthogonal representation space that makes representation relations more discriminative. Finally, experimental results demonstrate that collectively understanding the semantic, temporal, and spatial correspondence between audiovisual modalities enables the model to perform better on downstream tasks such as sound separation, sound spatialization, and audiovisual segmentation.",

keywords = "audiovisual learning, contrastive learning, representation learning, self-supervised learning",

author = "Zhaojian Li and Bin Zhao and Yuan Yuan",

note = "Publisher Copyright: {\textcopyright} 2023 ACM.; 31st ACM International Conference on Multimedia, MM 2023 ; Conference date: 29-10-2023 Through 03-11-2023",

year = "2023",

month = oct,

day = "26",

doi = "10.1145/3581783.3612428",

language = "英语",

series = "MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia",

publisher = "Association for Computing Machinery, Inc",

pages = "3755--3764",

booktitle = "MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia",

}

Li, Z, Zhao, B & Yuan, Y 2023, Bio-Inspired Audiovisual Multi-Representation Integration via Self-Supervised Learning. 在 MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia. MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia, Association for Computing Machinery, Inc, 页码 3755-3764, 31st ACM International Conference on Multimedia, MM 2023, Ottawa, 加拿大, 29/10/23. https://doi.org/10.1145/3581783.3612428

Bio-Inspired Audiovisual Multi-Representation Integration via Self-Supervised Learning. / Li, Zhaojian; Zhao, Bin ; Yuan, Yuan.
MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia. Association for Computing Machinery, Inc, 2023. 页码 3755-3764 (MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Bio-Inspired Audiovisual Multi-Representation Integration via Self-Supervised Learning

AU - Li, Zhaojian

AU - Zhao, Bin

AU - Yuan, Yuan

PY - 2023/10/26

Y1 - 2023/10/26

N2 - Audiovisual self-supervised representation learning has made significant strides in various audiovisual tasks. Existing methods mostly focus on single representation modeling between audio and visual modalities, ignoring the complex correspondence between them, resulting in the inability to execute cross-modal understanding in a more natural audiovisual scene. Several biological studies have shown that human learning is influenced by multi-layered synchronization of perception. To this end, inspired by biology, we argue to exploit the naturally existing relationships in audio and visual modalities to learn audiovisual representations under multilayer perceptual integration. Firstly, we introduce an audiovisual multi-representation pretext task that integrates semantic consistency, temporal alignment, and spatial correspondence. Secondly, we propose a self-supervised audiovisual multi-representation learning approach, which simultaneously learns the perceptual relationship between visual and audio modalities at semantic, temporal, and spatial levels. To establish fine-grained correspondence between visual objects and sounds, an audiovisual object detection module is proposed, which detects potential sounding objects by combining unsupervised knowledge at multiple levels. In addition, we propose a modality-wise loss and a task-wise loss to learn a subspace-orthogonal representation space that makes representation relations more discriminative. Finally, experimental results demonstrate that collectively understanding the semantic, temporal, and spatial correspondence between audiovisual modalities enables the model to perform better on downstream tasks such as sound separation, sound spatialization, and audiovisual segmentation.

AB - Audiovisual self-supervised representation learning has made significant strides in various audiovisual tasks. Existing methods mostly focus on single representation modeling between audio and visual modalities, ignoring the complex correspondence between them, resulting in the inability to execute cross-modal understanding in a more natural audiovisual scene. Several biological studies have shown that human learning is influenced by multi-layered synchronization of perception. To this end, inspired by biology, we argue to exploit the naturally existing relationships in audio and visual modalities to learn audiovisual representations under multilayer perceptual integration. Firstly, we introduce an audiovisual multi-representation pretext task that integrates semantic consistency, temporal alignment, and spatial correspondence. Secondly, we propose a self-supervised audiovisual multi-representation learning approach, which simultaneously learns the perceptual relationship between visual and audio modalities at semantic, temporal, and spatial levels. To establish fine-grained correspondence between visual objects and sounds, an audiovisual object detection module is proposed, which detects potential sounding objects by combining unsupervised knowledge at multiple levels. In addition, we propose a modality-wise loss and a task-wise loss to learn a subspace-orthogonal representation space that makes representation relations more discriminative. Finally, experimental results demonstrate that collectively understanding the semantic, temporal, and spatial correspondence between audiovisual modalities enables the model to perform better on downstream tasks such as sound separation, sound spatialization, and audiovisual segmentation.

KW - audiovisual learning

KW - contrastive learning

KW - representation learning

KW - self-supervised learning

UR - http://www.scopus.com/inward/record.url?scp=85179552361&partnerID=8YFLogxK

U2 - 10.1145/3581783.3612428

DO - 10.1145/3581783.3612428

M3 - 会议稿件

AN - SCOPUS:85179552361

T3 - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

SP - 3755

EP - 3764

BT - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

PB - Association for Computing Machinery, Inc

T2 - 31st ACM International Conference on Multimedia, MM 2023

Y2 - 29 October 2023 through 3 November 2023

ER -

Bio-Inspired Audiovisual Multi-Representation Integration via Self-Supervised Learning

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此