Channel-wise subband input for better voice and accompaniment separation on high resolution music

Haohe Liu; Lei Xie; Jian Wu; Geng Yang

doi:10.21437/Interspeech.2020-2555

Channel-wise subband input for better voice and accompaniment separation on high resolution music

Haohe Liu, Lei Xie, Jian Wu, Geng Yang

School of Computer Science

Northwestern Polytechnical University Xian

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

17 Scopus citations

Abstract

This paper presents a new input format, channel-wise subband input (CWS), for convolutional neural networks (CNN) based music source separation (MSS) models in the frequency domain. We aim to address the major issues in CNN-based high-resolution MSS model: high computational cost and weight sharing between distinctly different bands. Specifically, in this paper, we decompose the input mixture spectra into several bands and concatenate them channel-wise as the model input. The proposed approach enables effective weight sharing in each subband and introduces more flexibility between channels. For comparison purposes, we perform voice and accompaniment separation (VAS) on models with different scales, architectures, and CWS settings. Experiments show that the CWS input is beneficial in many aspects. We evaluate our method on musdb18hq test set, focusing on SDR, SIR and SAR metrics. Among all our experiments, CWS enables models to obtain 6.9% performance gain on the average metrics. With even a smaller number of parameters, less training data, and shorter training time, our MDenseNet with 8-bands CWS input still surpasses the original MMDenseNet with a large margin. Moreover, CWS also reduces computational cost and training time to a large extent.

Original language	English
Title of host publication	Interspeech 2020
Publisher	International Speech Communication Association
Pages	1241-1245
Number of pages	5
ISBN (Print)	9781713820697
DOIs	https://doi.org/10.21437/Interspeech.2020-2555
State	Published - 2020
Event	21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 - Shanghai, China Duration: 25 Oct 2020 → 29 Oct 2020

Publication series

Name	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	2020-October
ISSN (Print)	2308-457X
ISSN (Electronic)	1990-9772

Conference

Conference	21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
Country/Territory	China
City	Shanghai
Period	25/10/20 → 29/10/20

Keywords

Deep learning
Music source separation
Subband
Voice and accompaniment separation

Access to Document

10.21437/Interspeech.2020-2555

Cite this

Liu, H., Xie, L., Wu, J., & Yang, G. (2020). Channel-wise subband input for better voice and accompaniment separation on high resolution music. In Interspeech 2020 (pp. 1241-1245). (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; Vol. 2020-October). International Speech Communication Association. https://doi.org/10.21437/Interspeech.2020-2555

@inproceedings{15ca3ff21e2b49ce9e73648aab2b2698,

title = "Channel-wise subband input for better voice and accompaniment separation on high resolution music",

abstract = "This paper presents a new input format, channel-wise subband input (CWS), for convolutional neural networks (CNN) based music source separation (MSS) models in the frequency domain. We aim to address the major issues in CNN-based high-resolution MSS model: high computational cost and weight sharing between distinctly different bands. Specifically, in this paper, we decompose the input mixture spectra into several bands and concatenate them channel-wise as the model input. The proposed approach enables effective weight sharing in each subband and introduces more flexibility between channels. For comparison purposes, we perform voice and accompaniment separation (VAS) on models with different scales, architectures, and CWS settings. Experiments show that the CWS input is beneficial in many aspects. We evaluate our method on musdb18hq test set, focusing on SDR, SIR and SAR metrics. Among all our experiments, CWS enables models to obtain 6.9% performance gain on the average metrics. With even a smaller number of parameters, less training data, and shorter training time, our MDenseNet with 8-bands CWS input still surpasses the original MMDenseNet with a large margin. Moreover, CWS also reduces computational cost and training time to a large extent.",

keywords = "Deep learning, Music source separation, Subband, Voice and accompaniment separation",

author = "Haohe Liu and Lei Xie and Jian Wu and Geng Yang",

note = "Publisher Copyright: Copyright {\textcopyright} 2020 ISCA; 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 ; Conference date: 25-10-2020 Through 29-10-2020",

year = "2020",

doi = "10.21437/Interspeech.2020-2555",

language = "英语",

isbn = "9781713820697",

series = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

publisher = "International Speech Communication Association",

pages = "1241--1245",

booktitle = "Interspeech 2020",

}

Liu, H, Xie, L, Wu, J & Yang, G 2020, Channel-wise subband input for better voice and accompaniment separation on high resolution music. in Interspeech 2020. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2020-October, International Speech Communication Association, pp. 1241-1245, 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020, Shanghai, China, 25/10/20. https://doi.org/10.21437/Interspeech.2020-2555

Channel-wise subband input for better voice and accompaniment separation on high resolution music. / Liu, Haohe; Xie, Lei; Wu, Jian et al.
Interspeech 2020. International Speech Communication Association, 2020. p. 1241-1245 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; Vol. 2020-October).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Channel-wise subband input for better voice and accompaniment separation on high resolution music

AU - Liu, Haohe

AU - Xie, Lei

AU - Wu, Jian

AU - Yang, Geng

PY - 2020

Y1 - 2020

N2 - This paper presents a new input format, channel-wise subband input (CWS), for convolutional neural networks (CNN) based music source separation (MSS) models in the frequency domain. We aim to address the major issues in CNN-based high-resolution MSS model: high computational cost and weight sharing between distinctly different bands. Specifically, in this paper, we decompose the input mixture spectra into several bands and concatenate them channel-wise as the model input. The proposed approach enables effective weight sharing in each subband and introduces more flexibility between channels. For comparison purposes, we perform voice and accompaniment separation (VAS) on models with different scales, architectures, and CWS settings. Experiments show that the CWS input is beneficial in many aspects. We evaluate our method on musdb18hq test set, focusing on SDR, SIR and SAR metrics. Among all our experiments, CWS enables models to obtain 6.9% performance gain on the average metrics. With even a smaller number of parameters, less training data, and shorter training time, our MDenseNet with 8-bands CWS input still surpasses the original MMDenseNet with a large margin. Moreover, CWS also reduces computational cost and training time to a large extent.

AB - This paper presents a new input format, channel-wise subband input (CWS), for convolutional neural networks (CNN) based music source separation (MSS) models in the frequency domain. We aim to address the major issues in CNN-based high-resolution MSS model: high computational cost and weight sharing between distinctly different bands. Specifically, in this paper, we decompose the input mixture spectra into several bands and concatenate them channel-wise as the model input. The proposed approach enables effective weight sharing in each subband and introduces more flexibility between channels. For comparison purposes, we perform voice and accompaniment separation (VAS) on models with different scales, architectures, and CWS settings. Experiments show that the CWS input is beneficial in many aspects. We evaluate our method on musdb18hq test set, focusing on SDR, SIR and SAR metrics. Among all our experiments, CWS enables models to obtain 6.9% performance gain on the average metrics. With even a smaller number of parameters, less training data, and shorter training time, our MDenseNet with 8-bands CWS input still surpasses the original MMDenseNet with a large margin. Moreover, CWS also reduces computational cost and training time to a large extent.

KW - Deep learning

KW - Music source separation

KW - Subband

KW - Voice and accompaniment separation

UR - http://www.scopus.com/inward/record.url?scp=85098163411&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2020-2555

DO - 10.21437/Interspeech.2020-2555

M3 - 会议稿件

AN - SCOPUS:85098163411

SN - 9781713820697

T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SP - 1241

EP - 1245

BT - Interspeech 2020

PB - International Speech Communication Association

T2 - 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020

Y2 - 25 October 2020 through 29 October 2020

ER -

Channel-wise subband input for better voice and accompaniment separation on high resolution music

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this