TY - GEN
T1 - Cyclic Learning for Binaural Audio Generation and Localization
AU - Li, Zhaojian
AU - Zhao, Bin
AU - Yuan, Yuan
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Binaural audio is obtained by simulating the biological structure of human ears, which plays an important role in artificial immersive spaces. A promising approach is to utilize mono audio and corresponding vision to synthesize binaural audio, thereby avoiding expensive binaural audio recording. However, most existing methods di-rectly use the entire scene as a guide, ignoring the corre-spondence between sounds and sounding objects. In this paper, we advocate generating binaural audio using fine-grained raw waveform and object-level visual information as guidance. Specifically, we propose a Cyclic Locating-and-Ul'mixing (CLUP) framework that jointly learns vi-sual sounding object localization and binaural audio generation. Visual sounding object localization establishes the correspondence between specific visual objects and sound modalities, which provides object-aware guidance to improve binaural generation performance. Meanwhile, the spatial information contained in the generated binaural au-dio can further improve the performance of sounding object localization. In this case, visual sounding object localization and binaural audio generation can achieve cyclic learning and benefit from each other. Experimental re-sults demonstrate that on the FAIR-Play benchmark dataset, our method is significantly ahead of the existing baselines in multiple evaluation metrics (STFTJ↓: 0.787 vs. 0.851, ENVJ↑: 0.128 vs. 0.134, WAVJ↓: 5.244 vs. 5.684, SNR↑: 7.546 vs. 7.044).
AB - Binaural audio is obtained by simulating the biological structure of human ears, which plays an important role in artificial immersive spaces. A promising approach is to utilize mono audio and corresponding vision to synthesize binaural audio, thereby avoiding expensive binaural audio recording. However, most existing methods di-rectly use the entire scene as a guide, ignoring the corre-spondence between sounds and sounding objects. In this paper, we advocate generating binaural audio using fine-grained raw waveform and object-level visual information as guidance. Specifically, we propose a Cyclic Locating-and-Ul'mixing (CLUP) framework that jointly learns vi-sual sounding object localization and binaural audio generation. Visual sounding object localization establishes the correspondence between specific visual objects and sound modalities, which provides object-aware guidance to improve binaural generation performance. Meanwhile, the spatial information contained in the generated binaural au-dio can further improve the performance of sounding object localization. In this case, visual sounding object localization and binaural audio generation can achieve cyclic learning and benefit from each other. Experimental re-sults demonstrate that on the FAIR-Play benchmark dataset, our method is significantly ahead of the existing baselines in multiple evaluation metrics (STFTJ↓: 0.787 vs. 0.851, ENVJ↑: 0.128 vs. 0.134, WAVJ↓: 5.244 vs. 5.684, SNR↑: 7.546 vs. 7.044).
UR - http://www.scopus.com/inward/record.url?scp=85207244598&partnerID=8YFLogxK
U2 - 10.1109/CVPR52733.2024.02518
DO - 10.1109/CVPR52733.2024.02518
M3 - 会议稿件
AN - SCOPUS:85207244598
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 26659
EP - 26668
BT - Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
PB - IEEE Computer Society
T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
Y2 - 16 June 2024 through 22 June 2024
ER -