Abstract
This paper describes a deep bidirectional long short term memory (BLSTM) approach for speech-driven photo-realistic talking head animations. Long short-term memory (LSTM) is a recurrent neural network (RNN) architecture that is designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. The deep BLSTM-RNN model is applied using a speaker's audio-visual bimodal data. The active appearance model (AAM) is used to model the facial movements with AAM parameters as the prediction targets of the neural network. This paper studies the impacts of different network architectures and acoustic features. Tests on the LIPS2008 audio-visual corpus show that networks with BLSTM layer(s) consistently outperform those having only feed-forward layers. The results show that the best network has a feed-forward layer inserted into two BLSTM layers with 256 nodes (BFB256) in the dataset. The combination of FBank, pitch and energy gives the best performance feature set for the speech-driven talking head animation task.
Original language | English |
---|---|
Pages (from-to) | 250-256 |
Number of pages | 7 |
Journal | Qinghua Daxue Xuebao/Journal of Tsinghua University |
Volume | 57 |
Issue number | 3 |
DOIs | |
State | Published - 1 Mar 2017 |
Keywords
- Active appearance model (AAM)
- Bidirectional long short term memory (BLSTM)
- Facial animation
- Recurrent neural network (RNN)
- Talking avatar