TY - GEN
T1 - Robust Audio-visual Speech Recognition Using Bimodal Dfsmn with Multi-condition Training and Dropout Regularization
AU - Zhang, Shiliang
AU - Lei, Ming
AU - Ma, Bin
AU - Xie, Lei
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/5
Y1 - 2019/5
N2 - Audio-visual speech recognition (AVSR) is thought to be one of the potential solutions for robust speech recognition, especially in noisy environments. Compared to audio only speech recognition, the major issues of AVSR include the lack of publicly available audio-visual corpora and the need of robust knowledge fusion of both speech and vision. In this work, based on the recently released NTCD-TIMIT audio-visual corpus, we address the challenges of AVSR through three aspects: 1) optimal integration of acoustic and visual information; 2) robust performance with multi-condition training; 3) robust modeling against missing visual information during decoding. We propose a bimodal-DFSMN to jointly learn feature fusion and acoustic modeling, and utilize a per-frame dropout approach to enhance the robustness of AVSR system against the missing of visual modality. In the experiments, we construct two setups based on the NTCD-TIMIT corpus that consists of 5 hours clean training data and 150 hours multi-condition training data, respectively. As a result, we achieve a phone error rate of 12.6% on clean test set and an average phone error rate of 26.2% on all test sets (clean, various SNRs, various noise types), which both dramatically improve the baseline performance in NTCD-TIMIT task.
AB - Audio-visual speech recognition (AVSR) is thought to be one of the potential solutions for robust speech recognition, especially in noisy environments. Compared to audio only speech recognition, the major issues of AVSR include the lack of publicly available audio-visual corpora and the need of robust knowledge fusion of both speech and vision. In this work, based on the recently released NTCD-TIMIT audio-visual corpus, we address the challenges of AVSR through three aspects: 1) optimal integration of acoustic and visual information; 2) robust performance with multi-condition training; 3) robust modeling against missing visual information during decoding. We propose a bimodal-DFSMN to jointly learn feature fusion and acoustic modeling, and utilize a per-frame dropout approach to enhance the robustness of AVSR system against the missing of visual modality. In the experiments, we construct two setups based on the NTCD-TIMIT corpus that consists of 5 hours clean training data and 150 hours multi-condition training data, respectively. As a result, we achieve a phone error rate of 12.6% on clean test set and an average phone error rate of 26.2% on all test sets (clean, various SNRs, various noise types), which both dramatically improve the baseline performance in NTCD-TIMIT task.
KW - Audio-visual speech recognition
KW - bimodal DF-SMN
KW - dropout
KW - multi-condition training
KW - robust speech recognition
UR - http://www.scopus.com/inward/record.url?scp=85068971679&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2019.8682566
DO - 10.1109/ICASSP.2019.8682566
M3 - 会议稿件
AN - SCOPUS:85068971679
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 6570
EP - 6574
BT - 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019
Y2 - 12 May 2019 through 17 May 2019
ER -