Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

Tianyu Liu; Peng Zhang; Wei Huang; Yufei Zha; Tao You; Yanning Zhang

doi:10.1145/3581783.3612502

Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

Tianyu Liu, Peng Zhang, Wei Huang, Yufei Zha, Tao You, Yanning Zhang

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

2 引用（Scopus）

摘要

Self-supervised sound source localization is usually challenged by the modality inconsistency. In recent studies, contrastive learning based strategies have shown promising to establish such a consistent correspondence between audio and sound sources in visual scenarios. Unfortunately, the insufficient attention to the heterogeneity influence in the different modality features still limits this scheme to be further improved, which also becomes the motivation of our work. In this study, an Induction Network is proposed to bridge the modality gap more effectively. By decoupling the gradients of visual and audio modalities, the discriminative visual representations of sound sources can be learned with the designed Induction Vector in a bootstrap manner, which also enables the audio modality to be aligned with the visual modality consistently. In addition to a visual weighted contrastive loss, an adaptive threshold selection strategy is introduced to enhance the robustness of the Induction Network. Substantial experiments conducted on SoundNet-Flickr and VGG-Sound Source datasets have demonstrated a superior performance compared to other state-of-the-art works in different challenging scenarios. The code is available at https://github.com/Tahy1/AVIN.

源语言	英语
主期刊名	MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
出版商	Association for Computing Machinery, Inc
页	4042-4052
页数	11
ISBN（电子版）	9798400701085
DOI	https://doi.org/10.1145/3581783.3612502
出版状态	已出版 - 26 10月 2023
活动	31st ACM International Conference on Multimedia, MM 2023 - Ottawa, 加拿大期限: 29 10月 2023 → 3 11月 2023

出版系列

姓名	MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

会议

会议	31st ACM International Conference on Multimedia, MM 2023
国家/地区	加拿大
市	Ottawa
时期	29/10/23 → 3/11/23

访问文件

10.1145/3581783.3612502

其它文件与链接

链接到 Scopus 的出版物

引用此

Liu, T., Zhang, P., Huang, W., Zha, Y., You, T., & Zhang, Y. (2023). Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization. 在 MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia (页码 4042-4052). (MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia). Association for Computing Machinery, Inc. https://doi.org/10.1145/3581783.3612502

Liu, Tianyu ; Zhang, Peng ; Huang, Wei 等. / Induction Network : Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization. MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia. Association for Computing Machinery, Inc, 2023. 页码 4042-4052 (MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia).

@inproceedings{5ce46812d8464dc3a6c3a69d291235bd,

title = "Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization",

abstract = "Self-supervised sound source localization is usually challenged by the modality inconsistency. In recent studies, contrastive learning based strategies have shown promising to establish such a consistent correspondence between audio and sound sources in visual scenarios. Unfortunately, the insufficient attention to the heterogeneity influence in the different modality features still limits this scheme to be further improved, which also becomes the motivation of our work. In this study, an Induction Network is proposed to bridge the modality gap more effectively. By decoupling the gradients of visual and audio modalities, the discriminative visual representations of sound sources can be learned with the designed Induction Vector in a bootstrap manner, which also enables the audio modality to be aligned with the visual modality consistently. In addition to a visual weighted contrastive loss, an adaptive threshold selection strategy is introduced to enhance the robustness of the Induction Network. Substantial experiments conducted on SoundNet-Flickr and VGG-Sound Source datasets have demonstrated a superior performance compared to other state-of-the-art works in different challenging scenarios. The code is available at https://github.com/Tahy1/AVIN.",

keywords = "audio-visual, contrastive learning, modality gap, sound source localization",

author = "Tianyu Liu and Peng Zhang and Wei Huang and Yufei Zha and Tao You and Yanning Zhang",

note = "Publisher Copyright: {\textcopyright} 2023 ACM.; 31st ACM International Conference on Multimedia, MM 2023 ; Conference date: 29-10-2023 Through 03-11-2023",

year = "2023",

month = oct,

day = "26",

doi = "10.1145/3581783.3612502",

language = "英语",

series = "MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia",

publisher = "Association for Computing Machinery, Inc",

pages = "4042--4052",

booktitle = "MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia",

}

Liu, T, Zhang, P, Huang, W, Zha, Y, You, T & Zhang, Y 2023, Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization. 在 MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia. MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia, Association for Computing Machinery, Inc, 页码 4042-4052, 31st ACM International Conference on Multimedia, MM 2023, Ottawa, 加拿大, 29/10/23. https://doi.org/10.1145/3581783.3612502

Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization. / Liu, Tianyu; Zhang, Peng; Huang, Wei 等.
MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia. Association for Computing Machinery, Inc, 2023. 页码 4042-4052 (MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Induction Network

T2 - 31st ACM International Conference on Multimedia, MM 2023

AU - Liu, Tianyu

AU - Zhang, Peng

AU - Huang, Wei

AU - Zha, Yufei

AU - You, Tao

AU - Zhang, Yanning

PY - 2023/10/26

Y1 - 2023/10/26

N2 - Self-supervised sound source localization is usually challenged by the modality inconsistency. In recent studies, contrastive learning based strategies have shown promising to establish such a consistent correspondence between audio and sound sources in visual scenarios. Unfortunately, the insufficient attention to the heterogeneity influence in the different modality features still limits this scheme to be further improved, which also becomes the motivation of our work. In this study, an Induction Network is proposed to bridge the modality gap more effectively. By decoupling the gradients of visual and audio modalities, the discriminative visual representations of sound sources can be learned with the designed Induction Vector in a bootstrap manner, which also enables the audio modality to be aligned with the visual modality consistently. In addition to a visual weighted contrastive loss, an adaptive threshold selection strategy is introduced to enhance the robustness of the Induction Network. Substantial experiments conducted on SoundNet-Flickr and VGG-Sound Source datasets have demonstrated a superior performance compared to other state-of-the-art works in different challenging scenarios. The code is available at https://github.com/Tahy1/AVIN.

AB - Self-supervised sound source localization is usually challenged by the modality inconsistency. In recent studies, contrastive learning based strategies have shown promising to establish such a consistent correspondence between audio and sound sources in visual scenarios. Unfortunately, the insufficient attention to the heterogeneity influence in the different modality features still limits this scheme to be further improved, which also becomes the motivation of our work. In this study, an Induction Network is proposed to bridge the modality gap more effectively. By decoupling the gradients of visual and audio modalities, the discriminative visual representations of sound sources can be learned with the designed Induction Vector in a bootstrap manner, which also enables the audio modality to be aligned with the visual modality consistently. In addition to a visual weighted contrastive loss, an adaptive threshold selection strategy is introduced to enhance the robustness of the Induction Network. Substantial experiments conducted on SoundNet-Flickr and VGG-Sound Source datasets have demonstrated a superior performance compared to other state-of-the-art works in different challenging scenarios. The code is available at https://github.com/Tahy1/AVIN.

KW - audio-visual

KW - contrastive learning

KW - modality gap

KW - sound source localization

UR - http://www.scopus.com/inward/record.url?scp=85179547361&partnerID=8YFLogxK

U2 - 10.1145/3581783.3612502

DO - 10.1145/3581783.3612502

M3 - 会议稿件

AN - SCOPUS:85179547361

T3 - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

SP - 4042

EP - 4052

BT - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

PB - Association for Computing Machinery, Inc

Y2 - 29 October 2023 through 3 November 2023

ER -

Liu T, Zhang P, Huang W, Zha Y, You T, Zhang Y. Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization. 在 MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia. Association for Computing Machinery, Inc. 2023. 页码 4042-4052. (MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia). doi: 10.1145/3581783.3612502

Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此