Speech-driven video-realistic talking head synthesis using BLSTM-RNN

Shan Yang, Bo Fan, Lei Xie, Lijuan Wang, Geping Song

科研成果: 期刊稿件文章同行评审

3 引用 (Scopus)

摘要

This paper describes a deep bidirectional long short term memory (BLSTM) approach for speech-driven photo-realistic talking head animations. Long short-term memory (LSTM) is a recurrent neural network (RNN) architecture that is designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. The deep BLSTM-RNN model is applied using a speaker's audio-visual bimodal data. The active appearance model (AAM) is used to model the facial movements with AAM parameters as the prediction targets of the neural network. This paper studies the impacts of different network architectures and acoustic features. Tests on the LIPS2008 audio-visual corpus show that networks with BLSTM layer(s) consistently outperform those having only feed-forward layers. The results show that the best network has a feed-forward layer inserted into two BLSTM layers with 256 nodes (BFB256) in the dataset. The combination of FBank, pitch and energy gives the best performance feature set for the speech-driven talking head animation task.

源语言英语
页(从-至)250-256
页数7
期刊Qinghua Daxue Xuebao/Journal of Tsinghua University
57
3
DOI
出版状态已出版 - 1 3月 2017

指纹

探究 'Speech-driven video-realistic talking head synthesis using BLSTM-RNN' 的科研主题。它们共同构成独一无二的指纹。

引用此