Realistic mouth-synching for speech-driven talking face using articulatory modelling

Lei Xie; Zhi Qiang Liu

doi:10.1109/TMM.2006.888009

Realistic mouth-synching for speech-driven talking face using articulatory modelling

Lei Xie, Zhi Qiang Liu

科研成果: 期刊稿件 › 文章 › 同行评审

77 引用（Scopus）

摘要

This paper presents an articulatory modelling approach to convert acoustic speech into realistic mouth animation. We directly model the movements of articulators, such as lips, tongue, and teeth, using a dynamic Bayesian network (DBN)-based audio-visual articulatory model (AVAM). A multiple-stream structure with a shared articulator layer is adopted in the model to synchronously associate the two building blocks of speech, i.e., audio and video. This model not only describes the synchronization between visual articulatory movements and audio speech, but also reflects the linguistic fact that different articulators evolve asynchronously. We also present a Baum-Weich DBN inversion (DBNI) algorithm to generate optimal facial parameters from audio given the trained AVAM under maximum likelihood (ML) criterion. Extensive objective and subjective evaluations on the JEWEL audio-visual dataset demonstrate that compared with phonemic HMM approaches, facial parameters estimated by our approach follow the true parameters more accurately, and the synthesized facial animation sequences are so lively that 38% of them are undistinguishable.

源语言	英语
页（从-至）	500-510
页数	11
期刊	IEEE Transactions on Multimedia
卷	9
期	3
DOI	https://doi.org/10.1109/TMM.2006.888009
出版状态	已出版 - 4月 2007
已对外发布	是

访问文件

10.1109/TMM.2006.888009

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{1ce3ca5c679c4f89865b3b6060489240,

title = "Realistic mouth-synching for speech-driven talking face using articulatory modelling",

abstract = "This paper presents an articulatory modelling approach to convert acoustic speech into realistic mouth animation. We directly model the movements of articulators, such as lips, tongue, and teeth, using a dynamic Bayesian network (DBN)-based audio-visual articulatory model (AVAM). A multiple-stream structure with a shared articulator layer is adopted in the model to synchronously associate the two building blocks of speech, i.e., audio and video. This model not only describes the synchronization between visual articulatory movements and audio speech, but also reflects the linguistic fact that different articulators evolve asynchronously. We also present a Baum-Weich DBN inversion (DBNI) algorithm to generate optimal facial parameters from audio given the trained AVAM under maximum likelihood (ML) criterion. Extensive objective and subjective evaluations on the JEWEL audio-visual dataset demonstrate that compared with phonemic HMM approaches, facial parameters estimated by our approach follow the true parameters more accurately, and the synthesized facial animation sequences are so lively that 38% of them are undistinguishable.",

keywords = "Articulatory model, Baum-Welch DBN inversion (DBNI), Dynamic Bayesian networks (DBNs), Facial animation, Mouth-synching, Talking face",

author = "Lei Xie and Liu, {Zhi Qiang}",

year = "2007",

month = apr,

doi = "10.1109/TMM.2006.888009",

language = "英语",

volume = "9",

pages = "500--510",

journal = "IEEE Transactions on Multimedia",

issn = "1520-9210",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "3",

}

TY - JOUR

T1 - Realistic mouth-synching for speech-driven talking face using articulatory modelling

AU - Xie, Lei

AU - Liu, Zhi Qiang

PY - 2007/4

Y1 - 2007/4

N2 - This paper presents an articulatory modelling approach to convert acoustic speech into realistic mouth animation. We directly model the movements of articulators, such as lips, tongue, and teeth, using a dynamic Bayesian network (DBN)-based audio-visual articulatory model (AVAM). A multiple-stream structure with a shared articulator layer is adopted in the model to synchronously associate the two building blocks of speech, i.e., audio and video. This model not only describes the synchronization between visual articulatory movements and audio speech, but also reflects the linguistic fact that different articulators evolve asynchronously. We also present a Baum-Weich DBN inversion (DBNI) algorithm to generate optimal facial parameters from audio given the trained AVAM under maximum likelihood (ML) criterion. Extensive objective and subjective evaluations on the JEWEL audio-visual dataset demonstrate that compared with phonemic HMM approaches, facial parameters estimated by our approach follow the true parameters more accurately, and the synthesized facial animation sequences are so lively that 38% of them are undistinguishable.

AB - This paper presents an articulatory modelling approach to convert acoustic speech into realistic mouth animation. We directly model the movements of articulators, such as lips, tongue, and teeth, using a dynamic Bayesian network (DBN)-based audio-visual articulatory model (AVAM). A multiple-stream structure with a shared articulator layer is adopted in the model to synchronously associate the two building blocks of speech, i.e., audio and video. This model not only describes the synchronization between visual articulatory movements and audio speech, but also reflects the linguistic fact that different articulators evolve asynchronously. We also present a Baum-Weich DBN inversion (DBNI) algorithm to generate optimal facial parameters from audio given the trained AVAM under maximum likelihood (ML) criterion. Extensive objective and subjective evaluations on the JEWEL audio-visual dataset demonstrate that compared with phonemic HMM approaches, facial parameters estimated by our approach follow the true parameters more accurately, and the synthesized facial animation sequences are so lively that 38% of them are undistinguishable.

KW - Articulatory model

KW - Baum-Welch DBN inversion (DBNI)

KW - Dynamic Bayesian networks (DBNs)

KW - Facial animation

KW - Mouth-synching

KW - Talking face

UR - http://www.scopus.com/inward/record.url?scp=33947583073&partnerID=8YFLogxK

U2 - 10.1109/TMM.2006.888009

DO - 10.1109/TMM.2006.888009

M3 - 文章

AN - SCOPUS:33947583073

SN - 1520-9210

VL - 9

SP - 500

EP - 510

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

IS - 3

ER -

Realistic mouth-synching for speech-driven talking face using articulatory modelling

摘要

访问文件

其它文件与链接

指纹

引用此