How does Layer Normalization improve Batch Normalization in self-supervised sound source localization?

Tianyu Liu; Peng Zhang; Wei Huang; Yufei Zha; Tao You; Yanning Zhang

doi:10.1016/j.neucom.2023.127040

How does Layer Normalization improve Batch Normalization in self-supervised sound source localization?

Tianyu Liu, Peng Zhang, Wei Huang, Yufei Zha, Tao You, Yanning Zhang

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

5 引用（Scopus）

摘要

Self-supervised sound source localization is usually challenged by the unexpected large input and incorrect direction of normalization in current solutions. A promising way for this challenge is to avoid feature deformation by incorporating more effective normalization, which is the motivation of this study. Based on the mathematical derivation of Layer Normalization (LN) in scale independence, in this work, a correspondence consolidation method is proposed to reinforce the audio–visual correspondence. By ensembling input feature normalization and LN-based simsiam Predictor, a joint gradient stabilization can be further achieved for more accurate sound source localization. Substantial experiments conducted on SoundNet-Flickr and VGG-Sound Source datasets have verified a superior performance in comparison to the other state-of-the-art works.

源语言	英语
文章编号	127040
期刊	Neurocomputing
卷	567
DOI	https://doi.org/10.1016/j.neucom.2023.127040
出版状态	已出版 - 28 1月 2024

访问文件

10.1016/j.neucom.2023.127040

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{70954a5579af483388c18be56f4426bc,

title = "How does Layer Normalization improve Batch Normalization in self-supervised sound source localization?",

abstract = "Self-supervised sound source localization is usually challenged by the unexpected large input and incorrect direction of normalization in current solutions. A promising way for this challenge is to avoid feature deformation by incorporating more effective normalization, which is the motivation of this study. Based on the mathematical derivation of Layer Normalization (LN) in scale independence, in this work, a correspondence consolidation method is proposed to reinforce the audio–visual correspondence. By ensembling input feature normalization and LN-based simsiam Predictor, a joint gradient stabilization can be further achieved for more accurate sound source localization. Substantial experiments conducted on SoundNet-Flickr and VGG-Sound Source datasets have verified a superior performance in comparison to the other state-of-the-art works.",

keywords = "Audio-visual, Batch Normalization, Layer Normalization, Sound source localization",

author = "Tianyu Liu and Peng Zhang and Wei Huang and Yufei Zha and Tao You and Yanning Zhang",

note = "Publisher Copyright: {\textcopyright} 2023 Elsevier B.V.",

year = "2024",

month = jan,

day = "28",

doi = "10.1016/j.neucom.2023.127040",

language = "英语",

volume = "567",

journal = "Neurocomputing",