Head motion synthesis from speech using deep neural networks

Chuang Ding; Lei Xie; Pengcheng Zhu

doi:10.1007/s11042-014-2156-2

Head motion synthesis from speech using deep neural networks

Chuang Ding, Lei Xie, Pengcheng Zhu

School of Computer Science

Northwestern Polytechnical University Xian

Research output: Contribution to journal › Article › peer-review

43 Scopus citations

Abstract

This paper presents a deep neural network (DNN) approach for head motion synthesis, which can automatically predict head movement of a speaker from his/her speech. Specifically, we realize speech-to-head-motion mapping by learning a DNN from audio-visual broadcast news data. We first show that a generatively pre-trained neural network significantly outperforms a conventional randomly initialized network. We then demonstrate that filter bank (FBank) features outperform mel frequency cepstral coefficients (MFCC) and linear prediction coefficients (LPC) in head motion prediction. Finally, we discover that extra training data from other speakers used in the pre-training stage can improve the head motion prediction performance of a target speaker. Our promising results in speech-to-head-motion prediction can be used in talking avatar animation.

Original language	English
Pages (from-to)	9871-9888
Number of pages	18
Journal	Multimedia Tools and Applications
Volume	74
Issue number	22
DOIs	https://doi.org/10.1007/s11042-014-2156-2
State	Published - 24 Jul 2014

Keywords

Computer animation
Deep neural network
Head motion synthesis
Talking avatar

Access to Document

10.1007/s11042-014-2156-2

Cite this

@article{58842b58f748495a84ddd1c621b6ebf5,

title = "Head motion synthesis from speech using deep neural networks",

abstract = "This paper presents a deep neural network (DNN) approach for head motion synthesis, which can automatically predict head movement of a speaker from his/her speech. Specifically, we realize speech-to-head-motion mapping by learning a DNN from audio-visual broadcast news data. We first show that a generatively pre-trained neural network significantly outperforms a conventional randomly initialized network. We then demonstrate that filter bank (FBank) features outperform mel frequency cepstral coefficients (MFCC) and linear prediction coefficients (LPC) in head motion prediction. Finally, we discover that extra training data from other speakers used in the pre-training stage can improve the head motion prediction performance of a target speaker. Our promising results in speech-to-head-motion prediction can be used in talking avatar animation.",

keywords = "Computer animation, Deep neural network, Head motion synthesis, Talking avatar",

author = "Chuang Ding and Lei Xie and Pengcheng Zhu",

note = "Publisher Copyright: {\textcopyright} 2014, Springer Science+Business Media New York.",

year = "2014",

month = jul,

day = "24",

doi = "10.1007/s11042-014-2156-2",

language = "英语",

volume = "74",

pages = "9871--9888",

journal = "Multimedia Tools and Applications",

issn = "1380-7501",

publisher = "Springer",

number = "22",

}

TY - JOUR

T1 - Head motion synthesis from speech using deep neural networks

AU - Ding, Chuang

AU - Xie, Lei

AU - Zhu, Pengcheng

PY - 2014/7/24

Y1 - 2014/7/24

N2 - This paper presents a deep neural network (DNN) approach for head motion synthesis, which can automatically predict head movement of a speaker from his/her speech. Specifically, we realize speech-to-head-motion mapping by learning a DNN from audio-visual broadcast news data. We first show that a generatively pre-trained neural network significantly outperforms a conventional randomly initialized network. We then demonstrate that filter bank (FBank) features outperform mel frequency cepstral coefficients (MFCC) and linear prediction coefficients (LPC) in head motion prediction. Finally, we discover that extra training data from other speakers used in the pre-training stage can improve the head motion prediction performance of a target speaker. Our promising results in speech-to-head-motion prediction can be used in talking avatar animation.

AB - This paper presents a deep neural network (DNN) approach for head motion synthesis, which can automatically predict head movement of a speaker from his/her speech. Specifically, we realize speech-to-head-motion mapping by learning a DNN from audio-visual broadcast news data. We first show that a generatively pre-trained neural network significantly outperforms a conventional randomly initialized network. We then demonstrate that filter bank (FBank) features outperform mel frequency cepstral coefficients (MFCC) and linear prediction coefficients (LPC) in head motion prediction. Finally, we discover that extra training data from other speakers used in the pre-training stage can improve the head motion prediction performance of a target speaker. Our promising results in speech-to-head-motion prediction can be used in talking avatar animation.

KW - Computer animation

KW - Deep neural network

KW - Head motion synthesis

KW - Talking avatar

UR - http://www.scopus.com/inward/record.url?scp=84945435061&partnerID=8YFLogxK

U2 - 10.1007/s11042-014-2156-2

DO - 10.1007/s11042-014-2156-2

M3 - 文章

AN - SCOPUS:84945435061

SN - 1380-7501

VL - 74

SP - 9871

EP - 9888

JO - Multimedia Tools and Applications

JF - Multimedia Tools and Applications

IS - 22

ER -

Head motion synthesis from speech using deep neural networks

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this