A deep bidirectional LSTM approach for video-realistic talking head

Bo Fan; Lei Xie; Shan Yang; Lijuan Wang; Frank K. Soong

doi:10.1007/s11042-015-2944-3

A deep bidirectional LSTM approach for video-realistic talking head

Bo Fan, Lei Xie, Shan Yang, Lijuan Wang, Frank K. Soong

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

44 引用（Scopus）

摘要

This paper proposes a deep bidirectional long short-term memory approach in modeling the long contextual, nonlinear mapping between audio and visual streams for video-realistic talking head. In training stage, an audio-visual stereo database is firstly recorded as a subject talking to a camera. The audio streams are converted into acoustic feature, i.e. Mel-Frequency Cepstrum Coefficients (MFCCs), and their textual labels are also extracted. The visual streams, in particular, the lower face region, are compactly represented by active appearance model (AAM) parameters by which the shape and texture variations can be jointly modeled. Given pairs of the audio and visual parameter sequence, a DBLSTM model is trained to learn the sequence mapping from audio to visual space. For any unseen speech audio, whether it is original recorded or synthesized by text-to-speech (TTS), the trained DBLSTM model can predict a convincing AAM parameter trajectory for the lower face animation. To further improve the realism of the proposed talking head, the trajectory tiling method is adopted to use the DBLSTM predicted AAM trajectory as a guide to select a smooth real sample image sequence from the recorded database. We then stitch the selected lower face image sequence back to a background face video of the same subject, resulting in a video-realistic talking head. Experimental results show that the proposed DBLSTM approach outperforms the existing HMM-based approach in both objective and subjective evaluations.

源语言	英语
页（从-至）	5287-5309
页数	23
期刊	Multimedia Tools and Applications
卷	75
期	9
DOI	https://doi.org/10.1007/s11042-015-2944-3
出版状态	已出版 - 1 5月 2016

访问文件

10.1007/s11042-015-2944-3

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{a3c94351db2b419883c96874c82a0685,

title = "A deep bidirectional LSTM approach for video-realistic talking head",

abstract = "This paper proposes a deep bidirectional long short-term memory approach in modeling the long contextual, nonlinear mapping between audio and visual streams for video-realistic talking head. In training stage, an audio-visual stereo database is firstly recorded as a subject talking to a camera. The audio streams are converted into acoustic feature, i.e. Mel-Frequency Cepstrum Coefficients (MFCCs), and their textual labels are also extracted. The visual streams, in particular, the lower face region, are compactly represented by active appearance model (AAM) parameters by which the shape and texture variations can be jointly modeled. Given pairs of the audio and visual parameter sequence, a DBLSTM model is trained to learn the sequence mapping from audio to visual space. For any unseen speech audio, whether it is original recorded or synthesized by text-to-speech (TTS), the trained DBLSTM model can predict a convincing AAM parameter trajectory for the lower face animation. To further improve the realism of the proposed talking head, the trajectory tiling method is adopted to use the DBLSTM predicted AAM trajectory as a guide to select a smooth real sample image sequence from the recorded database. We then stitch the selected lower face image sequence back to a background face video of the same subject, resulting in a video-realistic talking head. Experimental results show that the proposed DBLSTM approach outperforms the existing HMM-based approach in both objective and subjective evaluations.",

keywords = "Active appearance model, Long short-term memory, Recurrent neural network, Talking head, Visual speech synthesis",

author = "Bo Fan and Lei Xie and Shan Yang and Lijuan Wang and Soong, {Frank K.}",

note = "Publisher Copyright: {\textcopyright} 2015, Springer Science+Business Media New York.",

year = "2016",

month = may,

day = "1",

doi = "10.1007/s11042-015-2944-3",

language = "英语",

volume = "75",

pages = "5287--5309",

journal = "Multimedia Tools and Applications",

issn = "1380-7501",

publisher = "Springer",

number = "9",

}

TY - JOUR

T1 - A deep bidirectional LSTM approach for video-realistic talking head

AU - Fan, Bo

AU - Xie, Lei

AU - Yang, Shan

AU - Wang, Lijuan

AU - Soong, Frank K.

PY - 2016/5/1

Y1 - 2016/5/1

N2 - This paper proposes a deep bidirectional long short-term memory approach in modeling the long contextual, nonlinear mapping between audio and visual streams for video-realistic talking head. In training stage, an audio-visual stereo database is firstly recorded as a subject talking to a camera. The audio streams are converted into acoustic feature, i.e. Mel-Frequency Cepstrum Coefficients (MFCCs), and their textual labels are also extracted. The visual streams, in particular, the lower face region, are compactly represented by active appearance model (AAM) parameters by which the shape and texture variations can be jointly modeled. Given pairs of the audio and visual parameter sequence, a DBLSTM model is trained to learn the sequence mapping from audio to visual space. For any unseen speech audio, whether it is original recorded or synthesized by text-to-speech (TTS), the trained DBLSTM model can predict a convincing AAM parameter trajectory for the lower face animation. To further improve the realism of the proposed talking head, the trajectory tiling method is adopted to use the DBLSTM predicted AAM trajectory as a guide to select a smooth real sample image sequence from the recorded database. We then stitch the selected lower face image sequence back to a background face video of the same subject, resulting in a video-realistic talking head. Experimental results show that the proposed DBLSTM approach outperforms the existing HMM-based approach in both objective and subjective evaluations.

AB - This paper proposes a deep bidirectional long short-term memory approach in modeling the long contextual, nonlinear mapping between audio and visual streams for video-realistic talking head. In training stage, an audio-visual stereo database is firstly recorded as a subject talking to a camera. The audio streams are converted into acoustic feature, i.e. Mel-Frequency Cepstrum Coefficients (MFCCs), and their textual labels are also extracted. The visual streams, in particular, the lower face region, are compactly represented by active appearance model (AAM) parameters by which the shape and texture variations can be jointly modeled. Given pairs of the audio and visual parameter sequence, a DBLSTM model is trained to learn the sequence mapping from audio to visual space. For any unseen speech audio, whether it is original recorded or synthesized by text-to-speech (TTS), the trained DBLSTM model can predict a convincing AAM parameter trajectory for the lower face animation. To further improve the realism of the proposed talking head, the trajectory tiling method is adopted to use the DBLSTM predicted AAM trajectory as a guide to select a smooth real sample image sequence from the recorded database. We then stitch the selected lower face image sequence back to a background face video of the same subject, resulting in a video-realistic talking head. Experimental results show that the proposed DBLSTM approach outperforms the existing HMM-based approach in both objective and subjective evaluations.

KW - Active appearance model

KW - Long short-term memory

KW - Recurrent neural network

KW - Talking head

KW - Visual speech synthesis

UR - http://www.scopus.com/inward/record.url?scp=84944699627&partnerID=8YFLogxK

U2 - 10.1007/s11042-015-2944-3

DO - 10.1007/s11042-015-2944-3

M3 - 文章

AN - SCOPUS:84944699627

SN - 1380-7501

VL - 75

SP - 5287

EP - 5309

JO - Multimedia Tools and Applications

JF - Multimedia Tools and Applications

IS - 9

ER -

A deep bidirectional LSTM approach for video-realistic talking head

摘要

访问文件

其它文件与链接

指纹

引用此