An end-to-end architecture of online multi-channel speech separation

Jian Wu; Zhuo Chen; Jinyu Li; Takuya Yoshioka; Zhili Tan; Ed Lin; Yi Luo; Lei Xie

doi:10.21437/Interspeech.2020-1981

An end-to-end architecture of online multi-channel speech separation

Jian Wu, Zhuo Chen, Jinyu Li, Takuya Yoshioka, Zhili Tan, Ed Lin, Yi Luo, Lei Xie

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

19 Scopus citations

Abstract

Multi-speaker speech recognition has been one of the key challenges in conversation transcription as it breaks the single active speaker assumption employed by most state-of-the-art speech recognition systems. Speech separation is considered as a remedy to this problem. Previously, we introduced a system, called unmixing, fixed-beamformer and extraction (UFE), that was shown to be effective in addressing the speech overlap problem in conversation transcription. With UFE, an input mixed signal is processed by fixed beamformers, followed by a neural network post filtering. Although promising results were obtained, the system contains multiple individually developed modules, leading potentially sub-optimum performance. In this work, we introduce an end-to-end modeling version of UFE. To enable gradient propagation all the way, an attentional selection module is proposed, where an attentional weight is learnt for each beamformer and spatial feature sampled over space. Experimental results show that the proposed system achieves comparable performance in an offline evaluation with the original separate processing-based pipeline, while producing remarkable improvements in an online evaluation.

Original language	English
Title of host publication	Interspeech 2020
Publisher	International Speech Communication Association
Pages	81-85
Number of pages	5
ISBN (Print)	9781713820697
DOIs	https://doi.org/10.21437/Interspeech.2020-1981
State	Published - 2020
Event	21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 - Shanghai, China Duration: 25 Oct 2020 → 29 Oct 2020

Publication series

Name	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	2020-October
ISSN (Print)	2308-457X
ISSN (Electronic)	1990-9772

Conference

Conference	21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
Country/Territory	China
City	Shanghai
Period	25/10/20 → 29/10/20

Keywords

Fixed beamformer
Multi-channel speech separation
Robust speech recognition
Source localization
Speaker extraction

Access to Document

10.21437/Interspeech.2020-1981

Cite this

Wu, J., Chen, Z., Li, J., Yoshioka, T., Tan, Z., Lin, E., Luo, Y., & Xie, L. (2020). An end-to-end architecture of online multi-channel speech separation. In Interspeech 2020 (pp. 81-85). (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; Vol. 2020-October). International Speech Communication Association. https://doi.org/10.21437/Interspeech.2020-1981

@inproceedings{9dbe525bf6934b9c873578c4d1d752c1,

title = "An end-to-end architecture of online multi-channel speech separation",

abstract = "Multi-speaker speech recognition has been one of the key challenges in conversation transcription as it breaks the single active speaker assumption employed by most state-of-the-art speech recognition systems. Speech separation is considered as a remedy to this problem. Previously, we introduced a system, called unmixing, fixed-beamformer and extraction (UFE), that was shown to be effective in addressing the speech overlap problem in conversation transcription. With UFE, an input mixed signal is processed by fixed beamformers, followed by a neural network post filtering. Although promising results were obtained, the system contains multiple individually developed modules, leading potentially sub-optimum performance. In this work, we introduce an end-to-end modeling version of UFE. To enable gradient propagation all the way, an attentional selection module is proposed, where an attentional weight is learnt for each beamformer and spatial feature sampled over space. Experimental results show that the proposed system achieves comparable performance in an offline evaluation with the original separate processing-based pipeline, while producing remarkable improvements in an online evaluation.",

keywords = "Fixed beamformer, Multi-channel speech separation, Robust speech recognition, Source localization, Speaker extraction",

author = "Jian Wu and Zhuo Chen and Jinyu Li and Takuya Yoshioka and Zhili Tan and Ed Lin and Yi Luo and Lei Xie",

note = "Publisher Copyright: Copyright {\textcopyright} 2020 ISCA; 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 ; Conference date: 25-10-2020 Through 29-10-2020",

year = "2020",

doi = "10.21437/Interspeech.2020-1981",

language = "英语",

isbn = "9781713820697",

series = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

publisher = "International Speech Communication Association",

pages = "81--85",

booktitle = "Interspeech 2020",

}

Wu, J, Chen, Z, Li, J, Yoshioka, T, Tan, Z, Lin, E, Luo, Y & Xie, L 2020, An end-to-end architecture of online multi-channel speech separation. in Interspeech 2020. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2020-October, International Speech Communication Association, pp. 81-85, 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020, Shanghai, China, 25/10/20. https://doi.org/10.21437/Interspeech.2020-1981

An end-to-end architecture of online multi-channel speech separation. / Wu, Jian; Chen, Zhuo; Li, Jinyu et al.
Interspeech 2020. International Speech Communication Association, 2020. p. 81-85 (Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH; Vol. 2020-October).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - An end-to-end architecture of online multi-channel speech separation

AU - Wu, Jian

AU - Chen, Zhuo

AU - Li, Jinyu

AU - Yoshioka, Takuya

AU - Tan, Zhili

AU - Lin, Ed

AU - Luo, Yi

AU - Xie, Lei

PY - 2020

Y1 - 2020

N2 - Multi-speaker speech recognition has been one of the key challenges in conversation transcription as it breaks the single active speaker assumption employed by most state-of-the-art speech recognition systems. Speech separation is considered as a remedy to this problem. Previously, we introduced a system, called unmixing, fixed-beamformer and extraction (UFE), that was shown to be effective in addressing the speech overlap problem in conversation transcription. With UFE, an input mixed signal is processed by fixed beamformers, followed by a neural network post filtering. Although promising results were obtained, the system contains multiple individually developed modules, leading potentially sub-optimum performance. In this work, we introduce an end-to-end modeling version of UFE. To enable gradient propagation all the way, an attentional selection module is proposed, where an attentional weight is learnt for each beamformer and spatial feature sampled over space. Experimental results show that the proposed system achieves comparable performance in an offline evaluation with the original separate processing-based pipeline, while producing remarkable improvements in an online evaluation.

AB - Multi-speaker speech recognition has been one of the key challenges in conversation transcription as it breaks the single active speaker assumption employed by most state-of-the-art speech recognition systems. Speech separation is considered as a remedy to this problem. Previously, we introduced a system, called unmixing, fixed-beamformer and extraction (UFE), that was shown to be effective in addressing the speech overlap problem in conversation transcription. With UFE, an input mixed signal is processed by fixed beamformers, followed by a neural network post filtering. Although promising results were obtained, the system contains multiple individually developed modules, leading potentially sub-optimum performance. In this work, we introduce an end-to-end modeling version of UFE. To enable gradient propagation all the way, an attentional selection module is proposed, where an attentional weight is learnt for each beamformer and spatial feature sampled over space. Experimental results show that the proposed system achieves comparable performance in an offline evaluation with the original separate processing-based pipeline, while producing remarkable improvements in an online evaluation.

KW - Fixed beamformer

KW - Multi-channel speech separation

KW - Robust speech recognition

KW - Source localization

KW - Speaker extraction

UR - http://www.scopus.com/inward/record.url?scp=85098117401&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2020-1981

DO - 10.21437/Interspeech.2020-1981

M3 - 会议稿件

AN - SCOPUS:85098117401

SN - 9781713820697

T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SP - 81

EP - 85

BT - Interspeech 2020

PB - International Speech Communication Association

T2 - 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020

Y2 - 25 October 2020 through 29 October 2020

ER -

An end-to-end architecture of online multi-channel speech separation

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this