Speech-driven video-realistic talking head synthesis using BLSTM-RNN

Shan Yang; Bo Fan; Lei Xie; Lijuan Wang; Geping Song

doi:10.16511/j.cnki.qhdxxb.2017.26.005

Speech-driven video-realistic talking head synthesis using BLSTM-RNN

Shan Yang, Bo Fan, Lei Xie, Lijuan Wang, Geping Song

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

3 引用（Scopus）

摘要

This paper describes a deep bidirectional long short term memory (BLSTM) approach for speech-driven photo-realistic talking head animations. Long short-term memory (LSTM) is a recurrent neural network (RNN) architecture that is designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. The deep BLSTM-RNN model is applied using a speaker's audio-visual bimodal data. The active appearance model (AAM) is used to model the facial movements with AAM parameters as the prediction targets of the neural network. This paper studies the impacts of different network architectures and acoustic features. Tests on the LIPS2008 audio-visual corpus show that networks with BLSTM layer(s) consistently outperform those having only feed-forward layers. The results show that the best network has a feed-forward layer inserted into two BLSTM layers with 256 nodes (BFB256) in the dataset. The combination of FBank, pitch and energy gives the best performance feature set for the speech-driven talking head animation task.

源语言	英语
页（从-至）	250-256
页数	7
期刊	Qinghua Daxue Xuebao/Journal of Tsinghua University
卷	57
期	3
DOI	https://doi.org/10.16511/j.cnki.qhdxxb.2017.26.005
出版状态	已出版 - 1 3月 2017

访问文件

10.16511/j.cnki.qhdxxb.2017.26.005

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{d26dffe05ccc42049174c530eba6f87d,

title = "Speech-driven video-realistic talking head synthesis using BLSTM-RNN",

abstract = "This paper describes a deep bidirectional long short term memory (BLSTM) approach for speech-driven photo-realistic talking head animations. Long short-term memory (LSTM) is a recurrent neural network (RNN) architecture that is designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. The deep BLSTM-RNN model is applied using a speaker's audio-visual bimodal data. The active appearance model (AAM) is used to model the facial movements with AAM parameters as the prediction targets of the neural network. This paper studies the impacts of different network architectures and acoustic features. Tests on the LIPS2008 audio-visual corpus show that networks with BLSTM layer(s) consistently outperform those having only feed-forward layers. The results show that the best network has a feed-forward layer inserted into two BLSTM layers with 256 nodes (BFB256) in the dataset. The combination of FBank, pitch and energy gives the best performance feature set for the speech-driven talking head animation task.",

keywords = "Active appearance model (AAM), Bidirectional long short term memory (BLSTM), Facial animation, Recurrent neural network (RNN), Talking avatar",

author = "Shan Yang and Bo Fan and Lei Xie and Lijuan Wang and Geping Song",

year = "2017",

month = mar,

day = "1",

doi = "10.16511/j.cnki.qhdxxb.2017.26.005",

language = "英语",

volume = "57",

pages = "250--256",

journal = "Qinghua Daxue Xuebao/Journal of Tsinghua University",

issn = "1000-0054",

publisher = "Tsinghua University Press",

number = "3",

}

TY - JOUR

T1 - Speech-driven video-realistic talking head synthesis using BLSTM-RNN

AU - Yang, Shan

AU - Fan, Bo

AU - Xie, Lei

AU - Wang, Lijuan

AU - Song, Geping

PY - 2017/3/1

Y1 - 2017/3/1

N2 - This paper describes a deep bidirectional long short term memory (BLSTM) approach for speech-driven photo-realistic talking head animations. Long short-term memory (LSTM) is a recurrent neural network (RNN) architecture that is designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. The deep BLSTM-RNN model is applied using a speaker's audio-visual bimodal data. The active appearance model (AAM) is used to model the facial movements with AAM parameters as the prediction targets of the neural network. This paper studies the impacts of different network architectures and acoustic features. Tests on the LIPS2008 audio-visual corpus show that networks with BLSTM layer(s) consistently outperform those having only feed-forward layers. The results show that the best network has a feed-forward layer inserted into two BLSTM layers with 256 nodes (BFB256) in the dataset. The combination of FBank, pitch and energy gives the best performance feature set for the speech-driven talking head animation task.

AB - This paper describes a deep bidirectional long short term memory (BLSTM) approach for speech-driven photo-realistic talking head animations. Long short-term memory (LSTM) is a recurrent neural network (RNN) architecture that is designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. The deep BLSTM-RNN model is applied using a speaker's audio-visual bimodal data. The active appearance model (AAM) is used to model the facial movements with AAM parameters as the prediction targets of the neural network. This paper studies the impacts of different network architectures and acoustic features. Tests on the LIPS2008 audio-visual corpus show that networks with BLSTM layer(s) consistently outperform those having only feed-forward layers. The results show that the best network has a feed-forward layer inserted into two BLSTM layers with 256 nodes (BFB256) in the dataset. The combination of FBank, pitch and energy gives the best performance feature set for the speech-driven talking head animation task.

KW - Active appearance model (AAM)

KW - Bidirectional long short term memory (BLSTM)

KW - Facial animation

KW - Recurrent neural network (RNN)

KW - Talking avatar

UR - http://www.scopus.com/inward/record.url?scp=85025074746&partnerID=8YFLogxK

U2 - 10.16511/j.cnki.qhdxxb.2017.26.005

DO - 10.16511/j.cnki.qhdxxb.2017.26.005

M3 - 文章

AN - SCOPUS:85025074746

SN - 1000-0054

VL - 57

SP - 250

EP - 256

JO - Qinghua Daxue Xuebao/Journal of Tsinghua University

JF - Qinghua Daxue Xuebao/Journal of Tsinghua University

IS - 3

ER -

Speech-driven video-realistic talking head synthesis using BLSTM-RNN

摘要

访问文件

其它文件与链接

指纹

引用此