Channel-wise subband input for better voice and accompaniment separation on high resolution music

Haohe Liu; Lei Xie; Jian Wu; Geng Yang

doi:10.21437/Interspeech.2020-2555

Channel-wise subband input for better voice and accompaniment separation on high resolution music

Haohe Liu, Lei Xie, Jian Wu, Geng Yang

计算机学院

Northwestern Polytechnical University Xian

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

17 引用（Scopus）

摘要

This paper presents a new input format, channel-wise subband input (CWS), for convolutional neural networks (CNN) based music source separation (MSS) models in the frequency domain. We aim to address the major issues in CNN-based high-resolution MSS model: high computational cost and weight sharing between distinctly different bands. Specifically, in this paper, we decompose the input mixture spectra into several bands and concatenate them channel-wise as the model input. The proposed approach enables effective weight sharing in each subband and introduces more flexibility between channels. For comparison purposes, we perform voice and accompaniment separation (VAS) on models with different scales, architectures, and CWS settings. Experiments show that the CWS input is beneficial in many aspects. We evaluate our method on musdb18hq test set, focusing on SDR, SIR and SAR metrics. Among all our experiments, CWS enables models to obtain 6.9% performance gain on the average metrics. With even a smaller number of parameters, less training data, and shorter training time, our MDenseNet with 8-bands CWS input still surpasses the original MMDenseNet with a large margin. Moreover, CWS also reduces computational cost and training time to a large extent.

源语言	英语
主期刊名	Interspeech 2020
出版商	International Speech Communication Association
页	1241-1245
页数	5
ISBN（印刷版）	9781713820697
DOI	https://doi.org/10.21437/Interspeech.2020-2555
出版状态	已出版 - 2020
活动	21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 - Shanghai, 中国期限: 25 10月 2020 → 29 10月 2020

出版系列

姓名	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
卷	2020-October
ISSN（印刷版）	2308-457X
ISSN（电子版）	1990-9772

会议

会议	21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
国家/地区	中国
市	Shanghai
时期	25/10/20 → 29/10/20

访问文件

10.21437/Interspeech.2020-2555

其它文件与链接

链接到 Scopus 的出版物

引用此

Liu, H., Xie, L., Wu, J., & Yang, G. (2020). Channel-wise subband input for better voice and accompaniment separation on high resolution music. 在 Interspeech 2020 (页码 1241-1245). (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; 卷 2020-October). International Speech Communication Association. https://doi.org/10.21437/Interspeech.2020-2555

@inproceedings{15ca3ff21e2b49ce9e73648aab2b2698,

title = "Channel-wise subband input for better voice and accompaniment separation on high resolution music",

abstract = "This paper presents a new input format, channel-wise subband input (CWS), for convolutional neural networks (CNN) based music source separation (MSS) models in the frequency domain. We aim to address the major issues in CNN-based high-resolution MSS model: high computational cost and weight sharing between distinctly different bands. Specifically, in this paper, we decompose the input mixture spectra into several bands and concatenate them channel-wise as the model input. The proposed approach enables effective weight sharing in each subband and introduces more flexibility between channels. For comparison purposes, we perform voice and accompaniment separation (VAS) on models with different scales, architectures, and CWS settings. Experiments show that the CWS input is beneficial in many aspects. We evaluate our method on musdb18hq test set, focusing on SDR, SIR and SAR metrics. Among all our experiments, CWS enables models to obtain 6.9% performance gain on the average metrics. With even a smaller number of parameters, less training data, and shorter training time, our MDenseNet with 8-bands CWS input still surpasses the original MMDenseNet with a large margin. Moreover, CWS also reduces computational cost and training time to a large extent.",

keywords = "Deep learning, Music source separation, Subband, Voice and accompaniment separation",

author = "Haohe Liu and Lei Xie and Jian Wu and Geng Yang",

note = "Publisher Copyright: Copyright {\textcopyright} 2020 ISCA; 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 ; Conference date: 25-10-2020 Through 29-10-2020",

year = "2020",

doi = "10.21437/Interspeech.2020-2555",

language = "英语",

isbn = "9781713820697",

series = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

publisher = "International Speech Communication Association",

pages = "1241--1245",

booktitle = "Interspeech 2020",

}

Liu, H, Xie, L, Wu, J & Yang, G 2020, Channel-wise subband input for better voice and accompaniment separation on high resolution music. 在 Interspeech 2020. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 卷 2020-October, International Speech Communication Association, 页码 1241-1245, 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020, Shanghai, 中国, 25/10/20. https://doi.org/10.21437/Interspeech.2020-2555

Channel-wise subband input for better voice and accompaniment separation on high resolution music. / Liu, Haohe; Xie, Lei; Wu, Jian 等.
Interspeech 2020. International Speech Communication Association, 2020. 页码 1241-1245 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; 卷 2020-October).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Channel-wise subband input for better voice and accompaniment separation on high resolution music

AU - Liu, Haohe

AU - Xie, Lei

AU - Wu, Jian

AU - Yang, Geng

PY - 2020

Y1 - 2020

N2 - This paper presents a new input format, channel-wise subband input (CWS), for convolutional neural networks (CNN) based music source separation (MSS) models in the frequency domain. We aim to address the major issues in CNN-based high-resolution MSS model: high computational cost and weight sharing between distinctly different bands. Specifically, in this paper, we decompose the input mixture spectra into several bands and concatenate them channel-wise as the model input. The proposed approach enables effective weight sharing in each subband and introduces more flexibility between channels. For comparison purposes, we perform voice and accompaniment separation (VAS) on models with different scales, architectures, and CWS settings. Experiments show that the CWS input is beneficial in many aspects. We evaluate our method on musdb18hq test set, focusing on SDR, SIR and SAR metrics. Among all our experiments, CWS enables models to obtain 6.9% performance gain on the average metrics. With even a smaller number of parameters, less training data, and shorter training time, our MDenseNet with 8-bands CWS input still surpasses the original MMDenseNet with a large margin. Moreover, CWS also reduces computational cost and training time to a large extent.

AB - This paper presents a new input format, channel-wise subband input (CWS), for convolutional neural networks (CNN) based music source separation (MSS) models in the frequency domain. We aim to address the major issues in CNN-based high-resolution MSS model: high computational cost and weight sharing between distinctly different bands. Specifically, in this paper, we decompose the input mixture spectra into several bands and concatenate them channel-wise as the model input. The proposed approach enables effective weight sharing in each subband and introduces more flexibility between channels. For comparison purposes, we perform voice and accompaniment separation (VAS) on models with different scales, architectures, and CWS settings. Experiments show that the CWS input is beneficial in many aspects. We evaluate our method on musdb18hq test set, focusing on SDR, SIR and SAR metrics. Among all our experiments, CWS enables models to obtain 6.9% performance gain on the average metrics. With even a smaller number of parameters, less training data, and shorter training time, our MDenseNet with 8-bands CWS input still surpasses the original MMDenseNet with a large margin. Moreover, CWS also reduces computational cost and training time to a large extent.

KW - Deep learning

KW - Music source separation

KW - Subband

KW - Voice and accompaniment separation

UR - http://www.scopus.com/inward/record.url?scp=85098163411&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2020-2555

DO - 10.21437/Interspeech.2020-2555

M3 - 会议稿件

AN - SCOPUS:85098163411

SN - 9781713820697

T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SP - 1241

EP - 1245

BT - Interspeech 2020

PB - International Speech Communication Association

T2 - 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020

Y2 - 25 October 2020 through 29 October 2020

ER -

Channel-wise subband input for better voice and accompaniment separation on high resolution music

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此