DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement

Yanxin Hu; Yun Liu; Shubo Lv; Mengtao Xing; Shimin Zhang; Yihui Fu; Jian Wu; Bihong Zhang; Lei Xie

doi:10.21437/Interspeech.2020-2537

DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement

Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang, Yihui Fu, Jian Wu, Bihong Zhang, Lei Xie

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

421 Scopus citations

Abstract

Speech enhancement has benefited from the success of deep learning in terms of intelligibility and perceptual quality. Conventional time-frequency (TF) domain methods focus on predicting TF-masks or speech spectrum, via a naive convolution neural network (CNN) or recurrent neural network (RNN). Some recent studies use complex-valued spectrogram as a training target but train in a real-valued network, predicting the magnitude and phase component or real and imaginary part, respectively. Particularly, convolution recurrent network (CRN) integrates a convolutional encoder-decoder (CED) structure and long short-term memory (LSTM), which has been proven to be helpful for complex targets. In order to train the complex target more effectively, in this paper, we design a new network structure simulating the complex-valued operation, called Deep Complex Convolution Recurrent Network (DCCRN), where both CNN and RNN structures can handle complex-valued operation. The proposed DCCRN models are very competitive over other previous networks, either on objective or subjective metric. With only 3.7M parameters, our DCCRN models submitted to the Interspeech 2020 Deep Noise Suppression (DNS) challenge ranked first for the real-time-track and second for the non-real-time track in terms of Mean Opinion Score (MOS).

Original language	English
Title of host publication	Interspeech 2020
Publisher	International Speech Communication Association
Pages	2472-2476
Number of pages	5
ISBN (Print)	9781713820697
DOIs	https://doi.org/10.21437/Interspeech.2020-2537
State	Published - 2020
Event	21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 - Shanghai, China Duration: 25 Oct 2020 → 29 Oct 2020

Publication series

Name	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	2020-October
ISSN (Print)	2308-457X
ISSN (Electronic)	1990-9772

Conference

Conference	21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
Country/Territory	China
City	Shanghai
Period	25/10/20 → 29/10/20

Keywords

Complex network
Deep learning
Denoise
Speech enhancement

Access to Document

10.21437/Interspeech.2020-2537

Cite this

Hu, Y., Liu, Y., Lv, S., Xing, M., Zhang, S., Fu, Y., Wu, J., Zhang, B., & Xie, L. (2020). DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement. In Interspeech 2020 (pp. 2472-2476). (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; Vol. 2020-October). International Speech Communication Association. https://doi.org/10.21437/Interspeech.2020-2537

@inproceedings{8ee93c3d14de4a0ea5d0d3f674254f5a,

title = "DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement",

abstract = "Speech enhancement has benefited from the success of deep learning in terms of intelligibility and perceptual quality. Conventional time-frequency (TF) domain methods focus on predicting TF-masks or speech spectrum, via a naive convolution neural network (CNN) or recurrent neural network (RNN). Some recent studies use complex-valued spectrogram as a training target but train in a real-valued network, predicting the magnitude and phase component or real and imaginary part, respectively. Particularly, convolution recurrent network (CRN) integrates a convolutional encoder-decoder (CED) structure and long short-term memory (LSTM), which has been proven to be helpful for complex targets. In order to train the complex target more effectively, in this paper, we design a new network structure simulating the complex-valued operation, called Deep Complex Convolution Recurrent Network (DCCRN), where both CNN and RNN structures can handle complex-valued operation. The proposed DCCRN models are very competitive over other previous networks, either on objective or subjective metric. With only 3.7M parameters, our DCCRN models submitted to the Interspeech 2020 Deep Noise Suppression (DNS) challenge ranked first for the real-time-track and second for the non-real-time track in terms of Mean Opinion Score (MOS).",

keywords = "Complex network, Deep learning, Denoise, Speech enhancement",

author = "Yanxin Hu and Yun Liu and Shubo Lv and Mengtao Xing and Shimin Zhang and Yihui Fu and Jian Wu and Bihong Zhang and Lei Xie",

note = "Publisher Copyright: Copyright {\textcopyright} 2020 ISCA; 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 ; Conference date: 25-10-2020 Through 29-10-2020",

year = "2020",

doi = "10.21437/Interspeech.2020-2537",

language = "英语",

isbn = "9781713820697",

series = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

publisher = "International Speech Communication Association",

pages = "2472--2476",

booktitle = "Interspeech 2020",

}

Hu, Y, Liu, Y, Lv, S, Xing, M, Zhang, S, Fu, Y, Wu, J, Zhang, B & Xie, L 2020, DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement. in Interspeech 2020. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2020-October, International Speech Communication Association, pp. 2472-2476, 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020, Shanghai, China, 25/10/20. https://doi.org/10.21437/Interspeech.2020-2537

DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement. / Hu, Yanxin; Liu, Yun; Lv, Shubo et al.
Interspeech 2020. International Speech Communication Association, 2020. p. 2472-2476 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; Vol. 2020-October).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - DCCRN

T2 - 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020

AU - Hu, Yanxin

AU - Liu, Yun

AU - Lv, Shubo

AU - Xing, Mengtao

AU - Zhang, Shimin

AU - Fu, Yihui

AU - Wu, Jian

AU - Zhang, Bihong

AU - Xie, Lei

PY - 2020

Y1 - 2020

N2 - Speech enhancement has benefited from the success of deep learning in terms of intelligibility and perceptual quality. Conventional time-frequency (TF) domain methods focus on predicting TF-masks or speech spectrum, via a naive convolution neural network (CNN) or recurrent neural network (RNN). Some recent studies use complex-valued spectrogram as a training target but train in a real-valued network, predicting the magnitude and phase component or real and imaginary part, respectively. Particularly, convolution recurrent network (CRN) integrates a convolutional encoder-decoder (CED) structure and long short-term memory (LSTM), which has been proven to be helpful for complex targets. In order to train the complex target more effectively, in this paper, we design a new network structure simulating the complex-valued operation, called Deep Complex Convolution Recurrent Network (DCCRN), where both CNN and RNN structures can handle complex-valued operation. The proposed DCCRN models are very competitive over other previous networks, either on objective or subjective metric. With only 3.7M parameters, our DCCRN models submitted to the Interspeech 2020 Deep Noise Suppression (DNS) challenge ranked first for the real-time-track and second for the non-real-time track in terms of Mean Opinion Score (MOS).

AB - Speech enhancement has benefited from the success of deep learning in terms of intelligibility and perceptual quality. Conventional time-frequency (TF) domain methods focus on predicting TF-masks or speech spectrum, via a naive convolution neural network (CNN) or recurrent neural network (RNN). Some recent studies use complex-valued spectrogram as a training target but train in a real-valued network, predicting the magnitude and phase component or real and imaginary part, respectively. Particularly, convolution recurrent network (CRN) integrates a convolutional encoder-decoder (CED) structure and long short-term memory (LSTM), which has been proven to be helpful for complex targets. In order to train the complex target more effectively, in this paper, we design a new network structure simulating the complex-valued operation, called Deep Complex Convolution Recurrent Network (DCCRN), where both CNN and RNN structures can handle complex-valued operation. The proposed DCCRN models are very competitive over other previous networks, either on objective or subjective metric. With only 3.7M parameters, our DCCRN models submitted to the Interspeech 2020 Deep Noise Suppression (DNS) challenge ranked first for the real-time-track and second for the non-real-time track in terms of Mean Opinion Score (MOS).

KW - Complex network

KW - Deep learning

KW - Denoise

KW - Speech enhancement

UR - http://www.scopus.com/inward/record.url?scp=85095506401&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2020-2537

DO - 10.21437/Interspeech.2020-2537

M3 - 会议稿件

AN - SCOPUS:85095506401

SN - 9781713820697

T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SP - 2472

EP - 2476

BT - Interspeech 2020

PB - International Speech Communication Association

Y2 - 25 October 2020 through 29 October 2020

ER -

DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this