Cyclic Learning for Binaural Audio Generation and Localization

Zhaojian Li; Bin Zhao; Yuan Yuan

doi:10.1109/CVPR52733.2024.02518

Cyclic Learning for Binaural Audio Generation and Localization

Zhaojian Li, Bin Zhao, Yuan Yuan

光电与智能研究院

Northwestern Polytechnical University Xian

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

3 引用（Scopus）

摘要

Binaural audio is obtained by simulating the biological structure of human ears, which plays an important role in artificial immersive spaces. A promising approach is to utilize mono audio and corresponding vision to synthesize binaural audio, thereby avoiding expensive binaural audio recording. However, most existing methods di-rectly use the entire scene as a guide, ignoring the corre-spondence between sounds and sounding objects. In this paper, we advocate generating binaural audio using fine-grained raw waveform and object-level visual information as guidance. Specifically, we propose a Cyclic Locating-and-Ul'mixing (CLUP) framework that jointly learns vi-sual sounding object localization and binaural audio generation. Visual sounding object localization establishes the correspondence between specific visual objects and sound modalities, which provides object-aware guidance to improve binaural generation performance. Meanwhile, the spatial information contained in the generated binaural au-dio can further improve the performance of sounding object localization. In this case, visual sounding object localization and binaural audio generation can achieve cyclic learning and benefit from each other. Experimental re-sults demonstrate that on the FAIR-Play benchmark dataset, our method is significantly ahead of the existing baselines in multiple evaluation metrics (STFTJ↓: 0.787 vs. 0.851, ENVJ↑: 0.128 vs. 0.134, WAVJ↓: 5.244 vs. 5.684, SNR↑: 7.546 vs. 7.044).

源语言	英语
主期刊名	Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
出版商	IEEE Computer Society
页	26659-26668
页数	10
ISBN（电子版）	9798350353006
DOI	https://doi.org/10.1109/CVPR52733.2024.02518
出版状态	已出版 - 2024
活动	2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 - Seattle, 美国期限: 16 6月 2024 → 22 6月 2024

出版系列

姓名	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
ISSN（印刷版）	1063-6919

会议

会议	2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
国家/地区	美国
市	Seattle
时期	16/06/24 → 22/06/24

访问文件

10.1109/CVPR52733.2024.02518

其它文件与链接

链接到 Scopus 的出版物

引用此

Li, Z., Zhao, B., & Yuan, Y. (2024). Cyclic Learning for Binaural Audio Generation and Localization. 在 Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 (页码 26659-26668). (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition). IEEE Computer Society. https://doi.org/10.1109/CVPR52733.2024.02518

@inproceedings{f97044eaeb0f4a53adb381e341260eef,

title = "Cyclic Learning for Binaural Audio Generation and Localization",

abstract = "Binaural audio is obtained by simulating the biological structure of human ears, which plays an important role in artificial immersive spaces. A promising approach is to utilize mono audio and corresponding vision to synthesize binaural audio, thereby avoiding expensive binaural audio recording. However, most existing methods di-rectly use the entire scene as a guide, ignoring the corre-spondence between sounds and sounding objects. In this paper, we advocate generating binaural audio using fine-grained raw waveform and object-level visual information as guidance. Specifically, we propose a Cyclic Locating-and-Ul'mixing (CLUP) framework that jointly learns vi-sual sounding object localization and binaural audio generation. Visual sounding object localization establishes the correspondence between specific visual objects and sound modalities, which provides object-aware guidance to improve binaural generation performance. Meanwhile, the spatial information contained in the generated binaural au-dio can further improve the performance of sounding object localization. In this case, visual sounding object localization and binaural audio generation can achieve cyclic learning and benefit from each other. Experimental re-sults demonstrate that on the FAIR-Play benchmark dataset, our method is significantly ahead of the existing baselines in multiple evaluation metrics (STFTJ↓: 0.787 vs. 0.851, ENVJ↑: 0.128 vs. 0.134, WAVJ↓: 5.244 vs. 5.684, SNR↑: 7.546 vs. 7.044).",

author = "Zhaojian Li and Bin Zhao and Yuan Yuan",

note = "Publisher Copyright: {\textcopyright} 2024 IEEE.; 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 ; Conference date: 16-06-2024 Through 22-06-2024",

year = "2024",

doi = "10.1109/CVPR52733.2024.02518",

language = "英语",

series = "Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition",

publisher = "IEEE Computer Society",

pages = "26659--26668",

booktitle = "Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024",

}

Li, Z, Zhao, B & Yuan, Y 2024, Cyclic Learning for Binaural Audio Generation and Localization. 在 Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, 页码 26659-26668, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, 美国, 16/06/24. https://doi.org/10.1109/CVPR52733.2024.02518

Cyclic Learning for Binaural Audio Generation and Localization. / Li, Zhaojian; Zhao, Bin ; Yuan, Yuan.
Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024. IEEE Computer Society, 2024. 页码 26659-26668 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Cyclic Learning for Binaural Audio Generation and Localization

AU - Li, Zhaojian

AU - Zhao, Bin

AU - Yuan, Yuan

PY - 2024

Y1 - 2024

N2 - Binaural audio is obtained by simulating the biological structure of human ears, which plays an important role in artificial immersive spaces. A promising approach is to utilize mono audio and corresponding vision to synthesize binaural audio, thereby avoiding expensive binaural audio recording. However, most existing methods di-rectly use the entire scene as a guide, ignoring the corre-spondence between sounds and sounding objects. In this paper, we advocate generating binaural audio using fine-grained raw waveform and object-level visual information as guidance. Specifically, we propose a Cyclic Locating-and-Ul'mixing (CLUP) framework that jointly learns vi-sual sounding object localization and binaural audio generation. Visual sounding object localization establishes the correspondence between specific visual objects and sound modalities, which provides object-aware guidance to improve binaural generation performance. Meanwhile, the spatial information contained in the generated binaural au-dio can further improve the performance of sounding object localization. In this case, visual sounding object localization and binaural audio generation can achieve cyclic learning and benefit from each other. Experimental re-sults demonstrate that on the FAIR-Play benchmark dataset, our method is significantly ahead of the existing baselines in multiple evaluation metrics (STFTJ↓: 0.787 vs. 0.851, ENVJ↑: 0.128 vs. 0.134, WAVJ↓: 5.244 vs. 5.684, SNR↑: 7.546 vs. 7.044).

AB - Binaural audio is obtained by simulating the biological structure of human ears, which plays an important role in artificial immersive spaces. A promising approach is to utilize mono audio and corresponding vision to synthesize binaural audio, thereby avoiding expensive binaural audio recording. However, most existing methods di-rectly use the entire scene as a guide, ignoring the corre-spondence between sounds and sounding objects. In this paper, we advocate generating binaural audio using fine-grained raw waveform and object-level visual information as guidance. Specifically, we propose a Cyclic Locating-and-Ul'mixing (CLUP) framework that jointly learns vi-sual sounding object localization and binaural audio generation. Visual sounding object localization establishes the correspondence between specific visual objects and sound modalities, which provides object-aware guidance to improve binaural generation performance. Meanwhile, the spatial information contained in the generated binaural au-dio can further improve the performance of sounding object localization. In this case, visual sounding object localization and binaural audio generation can achieve cyclic learning and benefit from each other. Experimental re-sults demonstrate that on the FAIR-Play benchmark dataset, our method is significantly ahead of the existing baselines in multiple evaluation metrics (STFTJ↓: 0.787 vs. 0.851, ENVJ↑: 0.128 vs. 0.134, WAVJ↓: 5.244 vs. 5.684, SNR↑: 7.546 vs. 7.044).

UR - http://www.scopus.com/inward/record.url?scp=85207244598&partnerID=8YFLogxK

U2 - 10.1109/CVPR52733.2024.02518

DO - 10.1109/CVPR52733.2024.02518

M3 - 会议稿件

AN - SCOPUS:85207244598

T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

SP - 26659

EP - 26668

BT - Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024

PB - IEEE Computer Society

T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024

Y2 - 16 June 2024 through 22 June 2024

ER -

Cyclic Learning for Binaural Audio Generation and Localization

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此