A coupled HMM approach to video-realistic speech animation

Lei Xie; Zhi Qiang Liu

doi:10.1016/j.patcog.2006.12.001

A coupled HMM approach to video-realistic speech animation

Lei Xie, Zhi Qiang Liu

City University of Hong Kong

Research output: Contribution to journal › Article › peer-review

67 Scopus citations

Abstract

We propose a coupled hidden Markov model (CHMM) approach to video-realistic speech animation, which realizes realistic facial animations driven by speaker independent continuous speech. Different from hidden Markov model (HMM)-based animation approaches that use a single-state chain, we use CHMMs to explicitly model the subtle characteristics of audio-visual speech, e.g., the asynchrony, temporal dependency (synchrony), and different speech classes between the two modalities. We derive an expectation maximization (EM)-based A/V conversion algorithm for the CHMMs, which converts acoustic speech into decent facial animation parameters. We also present a video-realistic speech animation system. The system transforms the facial animation parameters to a mouth animation sequence, refines the animation with a performance refinement process, and finally stitches the animated mouth with a background facial sequence seamlessly. We have compared the animation performance of the CHMM with the HMMs, the multi-stream HMMs and the factorial HMMs both objectively and subjectively. Results show that the CHMMs achieve superior animation performance. The ph-vi-CHMM system, which adopts different state variables (phoneme states and viseme states) in the audio and visual modalities, performs the best. The proposed approach indicates that explicitly modelling audio-visual speech is promising for speech animation.

Original language	English
Pages (from-to)	2325-2340
Number of pages	16
Journal	Pattern Recognition
Volume	40
Issue number	8
DOIs	https://doi.org/10.1016/j.patcog.2006.12.001
State	Published - Aug 2007
Externally published	Yes

Keywords

Audio-to-visual conversion
Coupled hidden Markov models (CHMMs)
Facial animation
Speech animation
Talking faces

Access to Document

10.1016/j.patcog.2006.12.001

Cite this

@article{c0bc5874333e47f393447e7916bf885b,

title = "A coupled HMM approach to video-realistic speech animation",

abstract = "We propose a coupled hidden Markov model (CHMM) approach to video-realistic speech animation, which realizes realistic facial animations driven by speaker independent continuous speech. Different from hidden Markov model (HMM)-based animation approaches that use a single-state chain, we use CHMMs to explicitly model the subtle characteristics of audio-visual speech, e.g., the asynchrony, temporal dependency (synchrony), and different speech classes between the two modalities. We derive an expectation maximization (EM)-based A/V conversion algorithm for the CHMMs, which converts acoustic speech into decent facial animation parameters. We also present a video-realistic speech animation system. The system transforms the facial animation parameters to a mouth animation sequence, refines the animation with a performance refinement process, and finally stitches the animated mouth with a background facial sequence seamlessly. We have compared the animation performance of the CHMM with the HMMs, the multi-stream HMMs and the factorial HMMs both objectively and subjectively. Results show that the CHMMs achieve superior animation performance. The ph-vi-CHMM system, which adopts different state variables (phoneme states and viseme states) in the audio and visual modalities, performs the best. The proposed approach indicates that explicitly modelling audio-visual speech is promising for speech animation.",

keywords = "Audio-to-visual conversion, Coupled hidden Markov models (CHMMs), Facial animation, Speech animation, Talking faces",

author = "Lei Xie and Liu, {Zhi Qiang}",

year = "2007",

month = aug,

doi = "10.1016/j.patcog.2006.12.001",

language = "英语",

volume = "40",

pages = "2325--2340",

journal = "Pattern Recognition",

issn = "0031-3203",

publisher = "Elsevier Ltd",

number = "8",

}

TY - JOUR

T1 - A coupled HMM approach to video-realistic speech animation

AU - Xie, Lei

AU - Liu, Zhi Qiang

PY - 2007/8

Y1 - 2007/8

N2 - We propose a coupled hidden Markov model (CHMM) approach to video-realistic speech animation, which realizes realistic facial animations driven by speaker independent continuous speech. Different from hidden Markov model (HMM)-based animation approaches that use a single-state chain, we use CHMMs to explicitly model the subtle characteristics of audio-visual speech, e.g., the asynchrony, temporal dependency (synchrony), and different speech classes between the two modalities. We derive an expectation maximization (EM)-based A/V conversion algorithm for the CHMMs, which converts acoustic speech into decent facial animation parameters. We also present a video-realistic speech animation system. The system transforms the facial animation parameters to a mouth animation sequence, refines the animation with a performance refinement process, and finally stitches the animated mouth with a background facial sequence seamlessly. We have compared the animation performance of the CHMM with the HMMs, the multi-stream HMMs and the factorial HMMs both objectively and subjectively. Results show that the CHMMs achieve superior animation performance. The ph-vi-CHMM system, which adopts different state variables (phoneme states and viseme states) in the audio and visual modalities, performs the best. The proposed approach indicates that explicitly modelling audio-visual speech is promising for speech animation.

AB - We propose a coupled hidden Markov model (CHMM) approach to video-realistic speech animation, which realizes realistic facial animations driven by speaker independent continuous speech. Different from hidden Markov model (HMM)-based animation approaches that use a single-state chain, we use CHMMs to explicitly model the subtle characteristics of audio-visual speech, e.g., the asynchrony, temporal dependency (synchrony), and different speech classes between the two modalities. We derive an expectation maximization (EM)-based A/V conversion algorithm for the CHMMs, which converts acoustic speech into decent facial animation parameters. We also present a video-realistic speech animation system. The system transforms the facial animation parameters to a mouth animation sequence, refines the animation with a performance refinement process, and finally stitches the animated mouth with a background facial sequence seamlessly. We have compared the animation performance of the CHMM with the HMMs, the multi-stream HMMs and the factorial HMMs both objectively and subjectively. Results show that the CHMMs achieve superior animation performance. The ph-vi-CHMM system, which adopts different state variables (phoneme states and viseme states) in the audio and visual modalities, performs the best. The proposed approach indicates that explicitly modelling audio-visual speech is promising for speech animation.

KW - Audio-to-visual conversion

KW - Coupled hidden Markov models (CHMMs)

KW - Facial animation

KW - Speech animation

KW - Talking faces

UR - http://www.scopus.com/inward/record.url?scp=34147186624&partnerID=8YFLogxK

U2 - 10.1016/j.patcog.2006.12.001

DO - 10.1016/j.patcog.2006.12.001

M3 - 文章

AN - SCOPUS:34147186624

SN - 0031-3203

VL - 40

SP - 2325

EP - 2340

JO - Pattern Recognition

JF - Pattern Recognition

IS - 8

ER -

A coupled HMM approach to video-realistic speech animation

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this