Head motion synthesis from speech using deep neural networks

Chuang Ding; Lei Xie; Pengcheng Zhu

doi:10.1007/s11042-014-2156-2

Head motion synthesis from speech using deep neural networks

Chuang Ding, Lei Xie, Pengcheng Zhu

计算机学院

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 文章 › 同行评审

43 引用（Scopus）

摘要

This paper presents a deep neural network (DNN) approach for head motion synthesis, which can automatically predict head movement of a speaker from his/her speech. Specifically, we realize speech-to-head-motion mapping by learning a DNN from audio-visual broadcast news data. We first show that a generatively pre-trained neural network significantly outperforms a conventional randomly initialized network. We then demonstrate that filter bank (FBank) features outperform mel frequency cepstral coefficients (MFCC) and linear prediction coefficients (LPC) in head motion prediction. Finally, we discover that extra training data from other speakers used in the pre-training stage can improve the head motion prediction performance of a target speaker. Our promising results in speech-to-head-motion prediction can be used in talking avatar animation.

源语言	英语
页（从-至）	9871-9888
页数	18
期刊	Multimedia Tools and Applications
卷	74
期	22
DOI	https://doi.org/10.1007/s11042-014-2156-2
出版状态	已出版 - 24 7月 2014

访问文件

10.1007/s11042-014-2156-2

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{58842b58f748495a84ddd1c621b6ebf5,

title = "Head motion synthesis from speech using deep neural networks",

abstract = "This paper presents a deep neural network (DNN) approach for head motion synthesis, which can automatically predict head movement of a speaker from his/her speech. Specifically, we realize speech-to-head-motion mapping by learning a DNN from audio-visual broadcast news data. We first show that a generatively pre-trained neural network significantly outperforms a conventional randomly initialized network. We then demonstrate that filter bank (FBank) features outperform mel frequency cepstral coefficients (MFCC) and linear prediction coefficients (LPC) in head motion prediction. Finally, we discover that extra training data from other speakers used in the pre-training stage can improve the head motion prediction performance of a target speaker. Our promising results in speech-to-head-motion prediction can be used in talking avatar animation.",

keywords = "Computer animation, Deep neural network, Head motion synthesis, Talking avatar",

author = "Chuang Ding and Lei Xie and Pengcheng Zhu",

note = "Publisher Copyright: {\textcopyright} 2014, Springer Science+Business Media New York.",

year = "2014",

month = jul,

day = "24",

doi = "10.1007/s11042-014-2156-2",

language = "英语",

volume = "74",

pages = "9871--9888",

journal = "Multimedia Tools and Applications",

issn = "1380-7501",

publisher = "Springer",

number = "22",

}

TY - JOUR

T1 - Head motion synthesis from speech using deep neural networks

AU - Ding, Chuang

AU - Xie, Lei

AU - Zhu, Pengcheng

PY - 2014/7/24

Y1 - 2014/7/24

N2 - This paper presents a deep neural network (DNN) approach for head motion synthesis, which can automatically predict head movement of a speaker from his/her speech. Specifically, we realize speech-to-head-motion mapping by learning a DNN from audio-visual broadcast news data. We first show that a generatively pre-trained neural network significantly outperforms a conventional randomly initialized network. We then demonstrate that filter bank (FBank) features outperform mel frequency cepstral coefficients (MFCC) and linear prediction coefficients (LPC) in head motion prediction. Finally, we discover that extra training data from other speakers used in the pre-training stage can improve the head motion prediction performance of a target speaker. Our promising results in speech-to-head-motion prediction can be used in talking avatar animation.

AB - This paper presents a deep neural network (DNN) approach for head motion synthesis, which can automatically predict head movement of a speaker from his/her speech. Specifically, we realize speech-to-head-motion mapping by learning a DNN from audio-visual broadcast news data. We first show that a generatively pre-trained neural network significantly outperforms a conventional randomly initialized network. We then demonstrate that filter bank (FBank) features outperform mel frequency cepstral coefficients (MFCC) and linear prediction coefficients (LPC) in head motion prediction. Finally, we discover that extra training data from other speakers used in the pre-training stage can improve the head motion prediction performance of a target speaker. Our promising results in speech-to-head-motion prediction can be used in talking avatar animation.

KW - Computer animation

KW - Deep neural network

KW - Head motion synthesis

KW - Talking avatar

UR - http://www.scopus.com/inward/record.url?scp=84945435061&partnerID=8YFLogxK

U2 - 10.1007/s11042-014-2156-2

DO - 10.1007/s11042-014-2156-2

M3 - 文章

AN - SCOPUS:84945435061

SN - 1380-7501

VL - 74

SP - 9871

EP - 9888

JO - Multimedia Tools and Applications

JF - Multimedia Tools and Applications

IS - 22

ER -

Head motion synthesis from speech using deep neural networks

摘要

访问文件

其它文件与链接

指纹

引用此