Time Domain Audio Visual Speech Separation

Jian Wu; Yong Xu; Shi Xiong Zhang; Lian Wu Chen; Meng Yu; Lei Xie; Dong Yu

doi:10.1109/ASRU46091.2019.9003983

Time Domain Audio Visual Speech Separation

Jian Wu, Yong Xu, Shi Xiong Zhang, Lian Wu Chen, Meng Yu, Lei Xie, Dong Yu

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

102 Scopus citations

Abstract

Audio-visual multi-modal modeling has been demonstrated to be effective in many speech related tasks, such as speech recognition and speech enhancement. This paper introduces a new time-domain audio-visual architecture for target speaker extraction from monaural mixtures. The architecture generalizes the previous TasNet (time-domain speech separation network) to enable multi-modal learning and at meanwhile it extends the classical audio-visual speech separation from frequency-domain to time-domain. The main components of proposed architecture include an audio encoder, a video encoder that extracts lip embedding from video streams, a multi-modal separation network and an audio decoder. Experiments on simulated mixtures based on recently released LRS2 dataset show that our method can bring 3dB+ and 4dB+ Si-SNR improvements on two-and three-speaker cases respectively, compared to audio-only TasNet and frequency-domain audio-visual networks.

Original language	English
Title of host publication	2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	667-673
Number of pages	7
ISBN (Electronic)	9781728103068
DOIs	https://doi.org/10.1109/ASRU46091.2019.9003983
State	Published - Dec 2019
Event	2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Singapore, Singapore Duration: 15 Dec 2019 → 18 Dec 2019

Publication series

Name	2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings

Conference

Conference	2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019
Country/Territory	Singapore
City	Singapore
Period	15/12/19 → 18/12/19

Keywords

TasNet
audio-visual speech separation
multi-modal learning
speech enhancement

Access to Document

10.1109/ASRU46091.2019.9003983

Cite this

Wu, J., Xu, Y., Zhang, S. X., Chen, L. W., Yu, M., Xie, L., & Yu, D. (2019). Time Domain Audio Visual Speech Separation. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings (pp. 667-673). Article 9003983 (2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ASRU46091.2019.9003983

@inproceedings{f9791f86e3ff42f49e61e1818ab0064b,

title = "Time Domain Audio Visual Speech Separation",

abstract = "Audio-visual multi-modal modeling has been demonstrated to be effective in many speech related tasks, such as speech recognition and speech enhancement. This paper introduces a new time-domain audio-visual architecture for target speaker extraction from monaural mixtures. The architecture generalizes the previous TasNet (time-domain speech separation network) to enable multi-modal learning and at meanwhile it extends the classical audio-visual speech separation from frequency-domain to time-domain. The main components of proposed architecture include an audio encoder, a video encoder that extracts lip embedding from video streams, a multi-modal separation network and an audio decoder. Experiments on simulated mixtures based on recently released LRS2 dataset show that our method can bring 3dB+ and 4dB+ Si-SNR improvements on two-and three-speaker cases respectively, compared to audio-only TasNet and frequency-domain audio-visual networks.",

keywords = "TasNet, audio-visual speech separation, multi-modal learning, speech enhancement",

author = "Jian Wu and Yong Xu and Zhang, {Shi Xiong} and Chen, {Lian Wu} and Meng Yu and Lei Xie and Dong Yu",

note = "Publisher Copyright: {\textcopyright} 2019 IEEE.; 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 ; Conference date: 15-12-2019 Through 18-12-2019",

year = "2019",

month = dec,

doi = "10.1109/ASRU46091.2019.9003983",

language = "英语",

series = "2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "667--673",

booktitle = "2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings",

}

Wu, J, Xu, Y, Zhang, SX, Chen, LW, Yu, M, Xie, L & Yu, D 2019, Time Domain Audio Visual Speech Separation. in 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings., 9003983, 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings, Institute of Electrical and Electronics Engineers Inc., pp. 667-673, 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019, Singapore, Singapore, 15/12/19. https://doi.org/10.1109/ASRU46091.2019.9003983

Time Domain Audio Visual Speech Separation. / Wu, Jian; Xu, Yong; Zhang, Shi Xiong et al.
2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. p. 667-673 9003983 (2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Time Domain Audio Visual Speech Separation

AU - Wu, Jian

AU - Xu, Yong

AU - Zhang, Shi Xiong

AU - Chen, Lian Wu

AU - Yu, Meng

AU - Xie, Lei

AU - Yu, Dong

PY - 2019/12

Y1 - 2019/12

N2 - Audio-visual multi-modal modeling has been demonstrated to be effective in many speech related tasks, such as speech recognition and speech enhancement. This paper introduces a new time-domain audio-visual architecture for target speaker extraction from monaural mixtures. The architecture generalizes the previous TasNet (time-domain speech separation network) to enable multi-modal learning and at meanwhile it extends the classical audio-visual speech separation from frequency-domain to time-domain. The main components of proposed architecture include an audio encoder, a video encoder that extracts lip embedding from video streams, a multi-modal separation network and an audio decoder. Experiments on simulated mixtures based on recently released LRS2 dataset show that our method can bring 3dB+ and 4dB+ Si-SNR improvements on two-and three-speaker cases respectively, compared to audio-only TasNet and frequency-domain audio-visual networks.

AB - Audio-visual multi-modal modeling has been demonstrated to be effective in many speech related tasks, such as speech recognition and speech enhancement. This paper introduces a new time-domain audio-visual architecture for target speaker extraction from monaural mixtures. The architecture generalizes the previous TasNet (time-domain speech separation network) to enable multi-modal learning and at meanwhile it extends the classical audio-visual speech separation from frequency-domain to time-domain. The main components of proposed architecture include an audio encoder, a video encoder that extracts lip embedding from video streams, a multi-modal separation network and an audio decoder. Experiments on simulated mixtures based on recently released LRS2 dataset show that our method can bring 3dB+ and 4dB+ Si-SNR improvements on two-and three-speaker cases respectively, compared to audio-only TasNet and frequency-domain audio-visual networks.

KW - TasNet

KW - audio-visual speech separation

KW - multi-modal learning

KW - speech enhancement

UR - http://www.scopus.com/inward/record.url?scp=85081540362&partnerID=8YFLogxK

U2 - 10.1109/ASRU46091.2019.9003983

DO - 10.1109/ASRU46091.2019.9003983

M3 - 会议稿件

AN - SCOPUS:85081540362

T3 - 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings

SP - 667

EP - 673

BT - 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019

Y2 - 15 December 2019 through 18 December 2019

ER -

Wu J, Xu Y, Zhang SX, Chen LW, Yu M, Xie L et al. Time Domain Audio Visual Speech Separation. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2019. p. 667-673. 9003983. (2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings). doi: 10.1109/ASRU46091.2019.9003983

Time Domain Audio Visual Speech Separation

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this