Multi-Channel Automatic Speech Recognition Using Deep Complex Unet

Yuxiang Kong; Jian Wu; Quandong Wang; Peng Gao; Weiji Zhuang; Yujun Wang; Lei Xie

doi:10.1109/SLT48900.2021.9383492

Multi-Channel Automatic Speech Recognition Using Deep Complex Unet

Yuxiang Kong, Jian Wu, Quandong Wang, Peng Gao, Weiji Zhuang, Yujun Wang, Lei Xie

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

10 引用（Scopus）

摘要

The front-end module in multi-channel automatic speech recognition (ASR) systems mainly use microphone array techniques to produce enhanced signals in noisy conditions with reverberation and echos. Recently, neural network (NN) based front-end has shown promising improvement over the conventional signal processing methods. In this paper, we propose to adopt the architecture of deep complex Unet (DCUnet) - a powerful complex-valued Unet-structured speech enhancement model - as the front-end of the multi-channel acoustic model, and integrate them in a multi-task learning (MTL) framework along with cascaded framework for comparison. Meanwhile, we investigate the proposed methods with several training strategies to improve the recognition accuracy on the 1000-hours real-world XiaoMi smart speaker data with echos. Experiments show that our proposed DCUnet-MTL method brings about 12.2% relative character error rate (CER) reduction compared with the traditional approach with array processing plus single-channel acoustic model. It also achieves superior performance than the recently proposed neural beamforming method.

源语言	英语
主期刊名	2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings
出版商	Institute of Electrical and Electronics Engineers Inc.
页	104-110
页数	7
ISBN（电子版）	9781728170664
DOI	https://doi.org/10.1109/SLT48900.2021.9383492
出版状态	已出版 - 19 1月 2021
活动	2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Virtual, Shenzhen, 中国期限: 19 1月 2021 → 22 1月 2021

出版系列

姓名	2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings

会议

会议	2021 IEEE Spoken Language Technology Workshop, SLT 2021
国家/地区	中国
市	Virtual, Shenzhen
时期	19/01/21 → 22/01/21

访问文件

10.1109/SLT48900.2021.9383492

其它文件与链接

链接到 Scopus 的出版物

引用此

Kong, Y., Wu, J., Wang, Q., Gao, P., Zhuang, W., Wang, Y., & Xie, L. (2021). Multi-Channel Automatic Speech Recognition Using Deep Complex Unet. 在 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings (页码 104-110). 文章 9383492 (2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/SLT48900.2021.9383492

@inproceedings{6fbd8ead60d642859159c3e6d008bcf4,

title = "Multi-Channel Automatic Speech Recognition Using Deep Complex Unet",

abstract = "The front-end module in multi-channel automatic speech recognition (ASR) systems mainly use microphone array techniques to produce enhanced signals in noisy conditions with reverberation and echos. Recently, neural network (NN) based front-end has shown promising improvement over the conventional signal processing methods. In this paper, we propose to adopt the architecture of deep complex Unet (DCUnet) - a powerful complex-valued Unet-structured speech enhancement model - as the front-end of the multi-channel acoustic model, and integrate them in a multi-task learning (MTL) framework along with cascaded framework for comparison. Meanwhile, we investigate the proposed methods with several training strategies to improve the recognition accuracy on the 1000-hours real-world XiaoMi smart speaker data with echos. Experiments show that our proposed DCUnet-MTL method brings about 12.2% relative character error rate (CER) reduction compared with the traditional approach with array processing plus single-channel acoustic model. It also achieves superior performance than the recently proposed neural beamforming method.",

keywords = "deep complex unet, deep learning, Multi-channel speech recognition, robust speech recognition",

author = "Yuxiang Kong and Jian Wu and Quandong Wang and Peng Gao and Weiji Zhuang and Yujun Wang and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2021 IEEE.; 2021 IEEE Spoken Language Technology Workshop, SLT 2021 ; Conference date: 19-01-2021 Through 22-01-2021",

year = "2021",

month = jan,

day = "19",

doi = "10.1109/SLT48900.2021.9383492",

language = "英语",

series = "2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "104--110",

booktitle = "2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings",

}

Kong, Y, Wu, J, Wang, Q, Gao, P, Zhuang, W, Wang, Y & Xie, L 2021, Multi-Channel Automatic Speech Recognition Using Deep Complex Unet. 在 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings., 9383492, 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings, Institute of Electrical and Electronics Engineers Inc., 页码 104-110, 2021 IEEE Spoken Language Technology Workshop, SLT 2021, Virtual, Shenzhen, 中国, 19/01/21. https://doi.org/10.1109/SLT48900.2021.9383492

Multi-Channel Automatic Speech Recognition Using Deep Complex Unet. / Kong, Yuxiang; Wu, Jian; Wang, Quandong 等.
2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2021. 页码 104-110 9383492 (2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Multi-Channel Automatic Speech Recognition Using Deep Complex Unet

AU - Kong, Yuxiang

AU - Wu, Jian

AU - Wang, Quandong

AU - Gao, Peng

AU - Zhuang, Weiji

AU - Wang, Yujun

AU - Xie, Lei

PY - 2021/1/19

Y1 - 2021/1/19

N2 - The front-end module in multi-channel automatic speech recognition (ASR) systems mainly use microphone array techniques to produce enhanced signals in noisy conditions with reverberation and echos. Recently, neural network (NN) based front-end has shown promising improvement over the conventional signal processing methods. In this paper, we propose to adopt the architecture of deep complex Unet (DCUnet) - a powerful complex-valued Unet-structured speech enhancement model - as the front-end of the multi-channel acoustic model, and integrate them in a multi-task learning (MTL) framework along with cascaded framework for comparison. Meanwhile, we investigate the proposed methods with several training strategies to improve the recognition accuracy on the 1000-hours real-world XiaoMi smart speaker data with echos. Experiments show that our proposed DCUnet-MTL method brings about 12.2% relative character error rate (CER) reduction compared with the traditional approach with array processing plus single-channel acoustic model. It also achieves superior performance than the recently proposed neural beamforming method.

AB - The front-end module in multi-channel automatic speech recognition (ASR) systems mainly use microphone array techniques to produce enhanced signals in noisy conditions with reverberation and echos. Recently, neural network (NN) based front-end has shown promising improvement over the conventional signal processing methods. In this paper, we propose to adopt the architecture of deep complex Unet (DCUnet) - a powerful complex-valued Unet-structured speech enhancement model - as the front-end of the multi-channel acoustic model, and integrate them in a multi-task learning (MTL) framework along with cascaded framework for comparison. Meanwhile, we investigate the proposed methods with several training strategies to improve the recognition accuracy on the 1000-hours real-world XiaoMi smart speaker data with echos. Experiments show that our proposed DCUnet-MTL method brings about 12.2% relative character error rate (CER) reduction compared with the traditional approach with array processing plus single-channel acoustic model. It also achieves superior performance than the recently proposed neural beamforming method.

KW - deep complex unet

KW - deep learning

KW - Multi-channel speech recognition

KW - robust speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85103987969&partnerID=8YFLogxK

U2 - 10.1109/SLT48900.2021.9383492

DO - 10.1109/SLT48900.2021.9383492

M3 - 会议稿件

AN - SCOPUS:85103987969

T3 - 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings

SP - 104

EP - 110

BT - 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2021 IEEE Spoken Language Technology Workshop, SLT 2021

Y2 - 19 January 2021 through 22 January 2021

ER -

Kong Y, Wu J, Wang Q, Gao P, Zhuang W, Wang Y 等. Multi-Channel Automatic Speech Recognition Using Deep Complex Unet. 在 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2021. 页码 104-110. 9383492. (2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings). doi: 10.1109/SLT48900.2021.9383492

Multi-Channel Automatic Speech Recognition Using Deep Complex Unet

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此