MBTFNET: Multi-Band Temporal-Frequency Neural Network for Singing Voice Enhancement

Weiming Xu; Zhouxuan Chen; Zhili Tan; Shubo Lv; Runduo Han; Wenjiang Zhou; Weifeng Zhao; Lei Xie

doi:10.1109/ASRU57964.2023.10389670

MBTFNET: Multi-Band Temporal-Frequency Neural Network for Singing Voice Enhancement

Weiming Xu, Zhouxuan Chen, Zhili Tan, Shubo Lv, Runduo Han, Wenjiang Zhou, Weifeng Zhao, Lei Xie

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

3 Scopus citations

Abstract

A typical neural speech enhancement (SE) approach mainly handles speech and noise mixtures, which is not optimal for singing voice enhancement scenarios where singing is often mixed with vocal-correlated accompanies and singing has substantial differences from speaking. Music source separation (MSS) models treat vocals and various accompaniment components equally, which may reduce performance compared to the model that only considers vocal enhancement. In this paper, we propose a novel multi-band temporal-frequency neural network (MBTFNet) for singing voice enhancement, which particularly removes background music, noise and even backing vocals from singing recordings. MBTFNet combines inter and intra-band modeling for better processing of full-band signals. Dual-path modeling in the temporal and frequency axis and temporal dilation blocks are introduced to expand the receptive field of the model. Particularly for removing backing vocals, we propose an implicit personalized enhancement (IPE) stage based on signal-to-noise ratio (SNR) estimation, which further improves the performance of MBTFNet. Experiments show that our proposed model significantly outperforms several state-of-the-art SE and MSS models.

Original language	English
Title of host publication	2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
Publisher	Institute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)	9798350306897
DOIs	https://doi.org/10.1109/ASRU57964.2023.10389670
State	Published - 2023
Event	2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023 - Taipei, Taiwan, Province of China Duration: 16 Dec 2023 → 20 Dec 2023

Publication series

Name	2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023

Conference

Conference	2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
Country/Territory	Taiwan, Province of China
City	Taipei
Period	16/12/23 → 20/12/23

Keywords

implicit personalized enhancement
MBTFNet
singing-voice enhancement

Access to Document

10.1109/ASRU57964.2023.10389670

Cite this

Xu, W., Chen, Z., Tan, Z., Lv, S., Han, R., Zhou, W., Zhao, W., & Xie, L. (2023). MBTFNET: Multi-Band Temporal-Frequency Neural Network for Singing Voice Enhancement. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023 (2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ASRU57964.2023.10389670

@inproceedings{703ce0c19a364e01ba692dc2e8b77adc,

title = "MBTFNET: Multi-Band Temporal-Frequency Neural Network for Singing Voice Enhancement",

abstract = "A typical neural speech enhancement (SE) approach mainly handles speech and noise mixtures, which is not optimal for singing voice enhancement scenarios where singing is often mixed with vocal-correlated accompanies and singing has substantial differences from speaking. Music source separation (MSS) models treat vocals and various accompaniment components equally, which may reduce performance compared to the model that only considers vocal enhancement. In this paper, we propose a novel multi-band temporal-frequency neural network (MBTFNet) for singing voice enhancement, which particularly removes background music, noise and even backing vocals from singing recordings. MBTFNet combines inter and intra-band modeling for better processing of full-band signals. Dual-path modeling in the temporal and frequency axis and temporal dilation blocks are introduced to expand the receptive field of the model. Particularly for removing backing vocals, we propose an implicit personalized enhancement (IPE) stage based on signal-to-noise ratio (SNR) estimation, which further improves the performance of MBTFNet. Experiments show that our proposed model significantly outperforms several state-of-the-art SE and MSS models.",

keywords = "implicit personalized enhancement, MBTFNet, singing-voice enhancement",

author = "Weiming Xu and Zhouxuan Chen and Zhili Tan and Shubo Lv and Runduo Han and Wenjiang Zhou and Weifeng Zhao and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023 ; Conference date: 16-12-2023 Through 20-12-2023",

year = "2023",

doi = "10.1109/ASRU57964.2023.10389670",

language = "英语",

series = "2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

booktitle = "2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023",

}

Xu, W, Chen, Z, Tan, Z, Lv, S, Han, R, Zhou, W, Zhao, W & Xie, L 2023, MBTFNET: Multi-Band Temporal-Frequency Neural Network for Singing Voice Enhancement. in 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023. 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023, Institute of Electrical and Electronics Engineers Inc., 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023, Taipei, Taiwan, Province of China, 16/12/23. https://doi.org/10.1109/ASRU57964.2023.10389670

MBTFNET: Multi-Band Temporal-Frequency Neural Network for Singing Voice Enhancement. / Xu, Weiming; Chen, Zhouxuan; Tan, Zhili et al.
2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023. Institute of Electrical and Electronics Engineers Inc., 2023. (2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - MBTFNET

T2 - 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023

AU - Xu, Weiming

AU - Chen, Zhouxuan

AU - Tan, Zhili

AU - Lv, Shubo

AU - Han, Runduo

AU - Zhou, Wenjiang

AU - Zhao, Weifeng

AU - Xie, Lei

PY - 2023

Y1 - 2023

N2 - A typical neural speech enhancement (SE) approach mainly handles speech and noise mixtures, which is not optimal for singing voice enhancement scenarios where singing is often mixed with vocal-correlated accompanies and singing has substantial differences from speaking. Music source separation (MSS) models treat vocals and various accompaniment components equally, which may reduce performance compared to the model that only considers vocal enhancement. In this paper, we propose a novel multi-band temporal-frequency neural network (MBTFNet) for singing voice enhancement, which particularly removes background music, noise and even backing vocals from singing recordings. MBTFNet combines inter and intra-band modeling for better processing of full-band signals. Dual-path modeling in the temporal and frequency axis and temporal dilation blocks are introduced to expand the receptive field of the model. Particularly for removing backing vocals, we propose an implicit personalized enhancement (IPE) stage based on signal-to-noise ratio (SNR) estimation, which further improves the performance of MBTFNet. Experiments show that our proposed model significantly outperforms several state-of-the-art SE and MSS models.

AB - A typical neural speech enhancement (SE) approach mainly handles speech and noise mixtures, which is not optimal for singing voice enhancement scenarios where singing is often mixed with vocal-correlated accompanies and singing has substantial differences from speaking. Music source separation (MSS) models treat vocals and various accompaniment components equally, which may reduce performance compared to the model that only considers vocal enhancement. In this paper, we propose a novel multi-band temporal-frequency neural network (MBTFNet) for singing voice enhancement, which particularly removes background music, noise and even backing vocals from singing recordings. MBTFNet combines inter and intra-band modeling for better processing of full-band signals. Dual-path modeling in the temporal and frequency axis and temporal dilation blocks are introduced to expand the receptive field of the model. Particularly for removing backing vocals, we propose an implicit personalized enhancement (IPE) stage based on signal-to-noise ratio (SNR) estimation, which further improves the performance of MBTFNet. Experiments show that our proposed model significantly outperforms several state-of-the-art SE and MSS models.

KW - implicit personalized enhancement

KW - MBTFNet

KW - singing-voice enhancement

UR - http://www.scopus.com/inward/record.url?scp=85184664663&partnerID=8YFLogxK

U2 - 10.1109/ASRU57964.2023.10389670

DO - 10.1109/ASRU57964.2023.10389670

M3 - 会议稿件

AN - SCOPUS:85184664663

T3 - 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023

BT - 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 16 December 2023 through 20 December 2023

ER -

Xu W, Chen Z, Tan Z, Lv S, Han R, Zhou W et al. MBTFNET: Multi-Band Temporal-Frequency Neural Network for Singing Voice Enhancement. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023. Institute of Electrical and Electronics Engineers Inc. 2023. (2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023). doi: 10.1109/ASRU57964.2023.10389670

MBTFNET: Multi-Band Temporal-Frequency Neural Network for Singing Voice Enhancement

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this