Photo-real talking head with deep bidirectional LSTM

Bo Fan; Lijuan Wang; Frank K. Soong; Lei Xie

doi:10.1109/ICASSP.2015.7178899

Photo-real talking head with deep bidirectional LSTM

Bo Fan, Lijuan Wang, Frank K. Soong, Lei Xie

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

102 引用（Scopus）

摘要

Long short-term memory (LSTM) is a specific recurrent neural network (RNN) architecture that is designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. In this paper, we propose to use deep bidirectional LSTM (BLSTM) for audio/visual modeling in our photo-real talking head system. An audio/visual database of a subject's talking is firstly recorded as our training data. The audio/visual stereo data are converted into two parallel temporal sequences, i.e., contextual label sequences obtained by forced aligning audio against text, and visual feature sequences by applying active-appearance-model (AAM) on the lower face region among all the training image samples. The deep BLSTM is then trained to learn the regression model by minimizing the sum of square error (SSE) of predicting visual sequence from label sequence. After testing different network topologies, we interestingly found the best network is two BLSTM layers sitting on top of one feed-forward layer on our datasets. Compared with our previous HMM-based system, the newly proposed deep BLSTM-based one is better on both objective measurement and subjective A/B test.

源语言	英语
主期刊名	2015 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2015 - Proceedings
出版商	Institute of Electrical and Electronics Engineers Inc.
页	4884-4888
页数	5
ISBN（电子版）	9781467369978
DOI	https://doi.org/10.1109/ICASSP.2015.7178899
出版状态	已出版 - 4 8月 2015
活动	40th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2015 - Brisbane, 澳大利亚期限: 19 4月 2014 → 24 4月 2014

出版系列

姓名	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
卷	2015-August
ISSN（印刷版）	1520-6149

会议

会议	40th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2015
国家/地区	澳大利亚
市	Brisbane
时期	19/04/14 → 24/04/14

访问文件

10.1109/ICASSP.2015.7178899

其它文件与链接

链接到 Scopus 的出版物

引用此

Fan, B., Wang, L., Soong, F. K., & Xie, L. (2015). Photo-real talking head with deep bidirectional LSTM. 在 2015 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2015 - Proceedings (页码 4884-4888). 文章 7178899 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; 卷 2015-August). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP.2015.7178899

Fan, Bo ; Wang, Lijuan ; Soong, Frank K. 等. / Photo-real talking head with deep bidirectional LSTM. 2015 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2015 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2015. 页码 4884-4888 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

@inproceedings{5d5683db52bf45d6876fc49c4899f4dd,

title = "Photo-real talking head with deep bidirectional LSTM",

abstract = "Long short-term memory (LSTM) is a specific recurrent neural network (RNN) architecture that is designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. In this paper, we propose to use deep bidirectional LSTM (BLSTM) for audio/visual modeling in our photo-real talking head system. An audio/visual database of a subject's talking is firstly recorded as our training data. The audio/visual stereo data are converted into two parallel temporal sequences, i.e., contextual label sequences obtained by forced aligning audio against text, and visual feature sequences by applying active-appearance-model (AAM) on the lower face region among all the training image samples. The deep BLSTM is then trained to learn the regression model by minimizing the sum of square error (SSE) of predicting visual sequence from label sequence. After testing different network topologies, we interestingly found the best network is two BLSTM layers sitting on top of one feed-forward layer on our datasets. Compared with our previous HMM-based system, the newly proposed deep BLSTM-based one is better on both objective measurement and subjective A/B test.",

keywords = "AAM, BLSTM, RNN, talking head",

author = "Bo Fan and Lijuan Wang and Soong, {Frank K.} and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2015 IEEE.; 40th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2015 ; Conference date: 19-04-2014 Through 24-04-2014",

year = "2015",

month = aug,

day = "4",

doi = "10.1109/ICASSP.2015.7178899",

language = "英语",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "4884--4888",

booktitle = "2015 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2015 - Proceedings",

}

Fan, B, Wang, L, Soong, FK & Xie, L 2015, Photo-real talking head with deep bidirectional LSTM. 在 2015 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2015 - Proceedings., 7178899, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 卷 2015-August, Institute of Electrical and Electronics Engineers Inc., 页码 4884-4888, 40th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2015, Brisbane, 澳大利亚, 19/04/14. https://doi.org/10.1109/ICASSP.2015.7178899

Photo-real talking head with deep bidirectional LSTM. / Fan, Bo; Wang, Lijuan; Soong, Frank K. 等.
2015 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2015 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2015. 页码 4884-4888 7178899 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; 卷 2015-August).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Photo-real talking head with deep bidirectional LSTM

AU - Fan, Bo

AU - Wang, Lijuan

AU - Soong, Frank K.

AU - Xie, Lei

PY - 2015/8/4

Y1 - 2015/8/4

N2 - Long short-term memory (LSTM) is a specific recurrent neural network (RNN) architecture that is designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. In this paper, we propose to use deep bidirectional LSTM (BLSTM) for audio/visual modeling in our photo-real talking head system. An audio/visual database of a subject's talking is firstly recorded as our training data. The audio/visual stereo data are converted into two parallel temporal sequences, i.e., contextual label sequences obtained by forced aligning audio against text, and visual feature sequences by applying active-appearance-model (AAM) on the lower face region among all the training image samples. The deep BLSTM is then trained to learn the regression model by minimizing the sum of square error (SSE) of predicting visual sequence from label sequence. After testing different network topologies, we interestingly found the best network is two BLSTM layers sitting on top of one feed-forward layer on our datasets. Compared with our previous HMM-based system, the newly proposed deep BLSTM-based one is better on both objective measurement and subjective A/B test.

AB - Long short-term memory (LSTM) is a specific recurrent neural network (RNN) architecture that is designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. In this paper, we propose to use deep bidirectional LSTM (BLSTM) for audio/visual modeling in our photo-real talking head system. An audio/visual database of a subject's talking is firstly recorded as our training data. The audio/visual stereo data are converted into two parallel temporal sequences, i.e., contextual label sequences obtained by forced aligning audio against text, and visual feature sequences by applying active-appearance-model (AAM) on the lower face region among all the training image samples. The deep BLSTM is then trained to learn the regression model by minimizing the sum of square error (SSE) of predicting visual sequence from label sequence. After testing different network topologies, we interestingly found the best network is two BLSTM layers sitting on top of one feed-forward layer on our datasets. Compared with our previous HMM-based system, the newly proposed deep BLSTM-based one is better on both objective measurement and subjective A/B test.

KW - AAM

KW - BLSTM

KW - RNN

KW - talking head

UR - http://www.scopus.com/inward/record.url?scp=84946029513&partnerID=8YFLogxK

U2 - 10.1109/ICASSP.2015.7178899

DO - 10.1109/ICASSP.2015.7178899

M3 - 会议稿件

AN - SCOPUS:84946029513

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 4884

EP - 4888

BT - 2015 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2015 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 40th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2015

Y2 - 19 April 2014 through 24 April 2014

ER -

Fan B, Wang L, Soong FK, Xie L. Photo-real talking head with deep bidirectional LSTM. 在 2015 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2015 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2015. 页码 4884-4888. 7178899. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.1109/ICASSP.2015.7178899

Photo-real talking head with deep bidirectional LSTM

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此