TY - GEN
T1 - An end-to-end architecture of online multi-channel speech separation
AU - Wu, Jian
AU - Chen, Zhuo
AU - Li, Jinyu
AU - Yoshioka, Takuya
AU - Tan, Zhili
AU - Lin, Ed
AU - Luo, Yi
AU - Xie, Lei
N1 - Publisher Copyright:
Copyright © 2020 ISCA
PY - 2020
Y1 - 2020
N2 - Multi-speaker speech recognition has been one of the key challenges in conversation transcription as it breaks the single active speaker assumption employed by most state-of-the-art speech recognition systems. Speech separation is considered as a remedy to this problem. Previously, we introduced a system, called unmixing, fixed-beamformer and extraction (UFE), that was shown to be effective in addressing the speech overlap problem in conversation transcription. With UFE, an input mixed signal is processed by fixed beamformers, followed by a neural network post filtering. Although promising results were obtained, the system contains multiple individually developed modules, leading potentially sub-optimum performance. In this work, we introduce an end-to-end modeling version of UFE. To enable gradient propagation all the way, an attentional selection module is proposed, where an attentional weight is learnt for each beamformer and spatial feature sampled over space. Experimental results show that the proposed system achieves comparable performance in an offline evaluation with the original separate processing-based pipeline, while producing remarkable improvements in an online evaluation.
AB - Multi-speaker speech recognition has been one of the key challenges in conversation transcription as it breaks the single active speaker assumption employed by most state-of-the-art speech recognition systems. Speech separation is considered as a remedy to this problem. Previously, we introduced a system, called unmixing, fixed-beamformer and extraction (UFE), that was shown to be effective in addressing the speech overlap problem in conversation transcription. With UFE, an input mixed signal is processed by fixed beamformers, followed by a neural network post filtering. Although promising results were obtained, the system contains multiple individually developed modules, leading potentially sub-optimum performance. In this work, we introduce an end-to-end modeling version of UFE. To enable gradient propagation all the way, an attentional selection module is proposed, where an attentional weight is learnt for each beamformer and spatial feature sampled over space. Experimental results show that the proposed system achieves comparable performance in an offline evaluation with the original separate processing-based pipeline, while producing remarkable improvements in an online evaluation.
KW - Fixed beamformer
KW - Multi-channel speech separation
KW - Robust speech recognition
KW - Source localization
KW - Speaker extraction
UR - http://www.scopus.com/inward/record.url?scp=85098117401&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2020-1981
DO - 10.21437/Interspeech.2020-1981
M3 - 会议稿件
AN - SCOPUS:85098117401
SN - 9781713820697
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 81
EP - 85
BT - Interspeech 2020
PB - International Speech Communication Association
T2 - 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
Y2 - 25 October 2020 through 29 October 2020
ER -