Speech-driven head motion synthesis using neural networks

Chuang Ding; Pengcheng Zhu; Lei Xie; Dongmei Jiang; Zhonghua Fu

Speech-driven head motion synthesis using neural networks

Chuang Ding, Pengcheng Zhu, Lei Xie, Dongmei Jiang, Zhonghua Fu

计算机学院

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 会议文章 › 同行评审

8 引用（Scopus）

摘要

This paper presents a neural network approach for speech-driven head motion synthesis, which can automatically predict a speaker's head movement from his/her speech. Specifically, we realize speech-to-head-motion mapping by learning a multi-layer perceptron from audio-visual broadcast news data. First, we show that a generatively pre-trained neural network significantly outperforms a randomly initialized network and the hidden Markov model (HMM) approach. Second, we demonstrate that the feature combination of log Mel-scale filter-bank (FBank), energy and fundamental frequency (F0) performs best in head motion prediction. Third, we discover that using long context acoustic information can further improve the performance. Finally, extra unlabeled training data used in the pre-training stage can achieve more performance gain. The proposed speech-driven head motion synthesis approach increases the CCA from 0.299 (the HMM approach) to 0.565 and it can be effectively used in expressive talking avatar animation.

源语言	英语
页（从-至）	2303-2307
页数	5
期刊	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
出版状态	已出版 - 2014
活动	15th Annual Conference of the International Speech Communication Association: Celebrating the Diversity of Spoken Languages, INTERSPEECH 2014 - Singapore, 新加坡期限: 14 9月 2014 → 18 9月 2014

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{23420dba9e6646ceaa1968d0ed3ec539,

title = "Speech-driven head motion synthesis using neural networks",

abstract = "This paper presents a neural network approach for speech-driven head motion synthesis, which can automatically predict a speaker's head movement from his/her speech. Specifically, we realize speech-to-head-motion mapping by learning a multi-layer perceptron from audio-visual broadcast news data. First, we show that a generatively pre-trained neural network significantly outperforms a randomly initialized network and the hidden Markov model (HMM) approach. Second, we demonstrate that the feature combination of log Mel-scale filter-bank (FBank), energy and fundamental frequency (F0) performs best in head motion prediction. Third, we discover that using long context acoustic information can further improve the performance. Finally, extra unlabeled training data used in the pre-training stage can achieve more performance gain. The proposed speech-driven head motion synthesis approach increases the CCA from 0.299 (the HMM approach) to 0.565 and it can be effectively used in expressive talking avatar animation.",

keywords = "Deep neural network, Head motion synthesis, Neural network, Talking avatar",

author = "Chuang Ding and Pengcheng Zhu and Lei Xie and Dongmei Jiang and Zhonghua Fu",

note = "Publisher Copyright: Copyright {\textcopyright} 2014 ISCA.; 15th Annual Conference of the International Speech Communication Association: Celebrating the Diversity of Spoken Languages, INTERSPEECH 2014 ; Conference date: 14-09-2014 Through 18-09-2014",

year = "2014",

language = "英语",

pages = "2303--2307",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - Speech-driven head motion synthesis using neural networks

AU - Ding, Chuang

AU - Zhu, Pengcheng

AU - Xie, Lei

AU - Jiang, Dongmei

AU - Fu, Zhonghua

PY - 2014

Y1 - 2014

N2 - This paper presents a neural network approach for speech-driven head motion synthesis, which can automatically predict a speaker's head movement from his/her speech. Specifically, we realize speech-to-head-motion mapping by learning a multi-layer perceptron from audio-visual broadcast news data. First, we show that a generatively pre-trained neural network significantly outperforms a randomly initialized network and the hidden Markov model (HMM) approach. Second, we demonstrate that the feature combination of log Mel-scale filter-bank (FBank), energy and fundamental frequency (F0) performs best in head motion prediction. Third, we discover that using long context acoustic information can further improve the performance. Finally, extra unlabeled training data used in the pre-training stage can achieve more performance gain. The proposed speech-driven head motion synthesis approach increases the CCA from 0.299 (the HMM approach) to 0.565 and it can be effectively used in expressive talking avatar animation.

AB - This paper presents a neural network approach for speech-driven head motion synthesis, which can automatically predict a speaker's head movement from his/her speech. Specifically, we realize speech-to-head-motion mapping by learning a multi-layer perceptron from audio-visual broadcast news data. First, we show that a generatively pre-trained neural network significantly outperforms a randomly initialized network and the hidden Markov model (HMM) approach. Second, we demonstrate that the feature combination of log Mel-scale filter-bank (FBank), energy and fundamental frequency (F0) performs best in head motion prediction. Third, we discover that using long context acoustic information can further improve the performance. Finally, extra unlabeled training data used in the pre-training stage can achieve more performance gain. The proposed speech-driven head motion synthesis approach increases the CCA from 0.299 (the HMM approach) to 0.565 and it can be effectively used in expressive talking avatar animation.

KW - Deep neural network

KW - Head motion synthesis

KW - Neural network

KW - Talking avatar

UR - http://www.scopus.com/inward/record.url?scp=84910030988&partnerID=8YFLogxK

M3 - 会议文章

AN - SCOPUS:84910030988

SN - 2308-457X

SP - 2303

EP - 2307

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 15th Annual Conference of the International Speech Communication Association: Celebrating the Diversity of Spoken Languages, INTERSPEECH 2014

Y2 - 14 September 2014 through 18 September 2014

ER -

Speech-driven head motion synthesis using neural networks

摘要

其它文件与链接

指纹

引用此