TY - GEN
T1 - A speech enhancement system for automotive speech recognition with a hybrid voice activity detection method
AU - Wang, Haikun
AU - Ye, Zhongfu
AU - Chen, Jingdong
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/11/2
Y1 - 2018/11/2
N2 - This paper presents a front-end speech enhancement approach to robust speech recognition in automotive environments. It combines hybrid voice activity detection (VAD), relative transfer function (RT-F) based generalized sidelobe cancelation, and single-channel post filtering to enhance the speech signal of interest, thereby improving the robustness of speech recognition. First, we choose four typical driving scenarios, which include most of the noise types in automobiles to record training data. The recorded data is then used to train deep neural network models (DNNs) for both speech and noise. The trained DNNs are subsequently used to estimate the speech presence probability on a frame-by-frame basis. This speech presence probability is then combined with the output of an energy-based VAD to form a hybrid VAD, which serves as the basis for the rest components of the speech enhancement system, including RTF estimation, adaptive beamforming, and post-filtering. Experiments are conducted in real automotive environments. The results show that the developed method can significantly improve the performance of both VAD and automatic speech recognition (ASR).
AB - This paper presents a front-end speech enhancement approach to robust speech recognition in automotive environments. It combines hybrid voice activity detection (VAD), relative transfer function (RT-F) based generalized sidelobe cancelation, and single-channel post filtering to enhance the speech signal of interest, thereby improving the robustness of speech recognition. First, we choose four typical driving scenarios, which include most of the noise types in automobiles to record training data. The recorded data is then used to train deep neural network models (DNNs) for both speech and noise. The trained DNNs are subsequently used to estimate the speech presence probability on a frame-by-frame basis. This speech presence probability is then combined with the output of an energy-based VAD to form a hybrid VAD, which serves as the basis for the rest components of the speech enhancement system, including RTF estimation, adaptive beamforming, and post-filtering. Experiments are conducted in real automotive environments. The results show that the developed method can significantly improve the performance of both VAD and automatic speech recognition (ASR).
KW - Deep neural network
KW - Microphone array
KW - Speech enhancement
KW - Speech recognition
KW - Voice activity detection
UR - http://www.scopus.com/inward/record.url?scp=85057420738&partnerID=8YFLogxK
U2 - 10.1109/IWAENC.2018.8521410
DO - 10.1109/IWAENC.2018.8521410
M3 - 会议稿件
AN - SCOPUS:85057420738
T3 - 16th International Workshop on Acoustic Signal Enhancement, IWAENC 2018 - Proceedings
SP - 456
EP - 460
BT - 16th International Workshop on Acoustic Signal Enhancement, IWAENC 2018 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 16th International Workshop on Acoustic Signal Enhancement, IWAENC 2018
Y2 - 17 September 2018 through 20 September 2018
ER -