TY - JOUR
T1 - Realistic mouth-synching for speech-driven talking face using articulatory modelling
AU - Xie, Lei
AU - Liu, Zhi Qiang
PY - 2007/4
Y1 - 2007/4
N2 - This paper presents an articulatory modelling approach to convert acoustic speech into realistic mouth animation. We directly model the movements of articulators, such as lips, tongue, and teeth, using a dynamic Bayesian network (DBN)-based audio-visual articulatory model (AVAM). A multiple-stream structure with a shared articulator layer is adopted in the model to synchronously associate the two building blocks of speech, i.e., audio and video. This model not only describes the synchronization between visual articulatory movements and audio speech, but also reflects the linguistic fact that different articulators evolve asynchronously. We also present a Baum-Weich DBN inversion (DBNI) algorithm to generate optimal facial parameters from audio given the trained AVAM under maximum likelihood (ML) criterion. Extensive objective and subjective evaluations on the JEWEL audio-visual dataset demonstrate that compared with phonemic HMM approaches, facial parameters estimated by our approach follow the true parameters more accurately, and the synthesized facial animation sequences are so lively that 38% of them are undistinguishable.
AB - This paper presents an articulatory modelling approach to convert acoustic speech into realistic mouth animation. We directly model the movements of articulators, such as lips, tongue, and teeth, using a dynamic Bayesian network (DBN)-based audio-visual articulatory model (AVAM). A multiple-stream structure with a shared articulator layer is adopted in the model to synchronously associate the two building blocks of speech, i.e., audio and video. This model not only describes the synchronization between visual articulatory movements and audio speech, but also reflects the linguistic fact that different articulators evolve asynchronously. We also present a Baum-Weich DBN inversion (DBNI) algorithm to generate optimal facial parameters from audio given the trained AVAM under maximum likelihood (ML) criterion. Extensive objective and subjective evaluations on the JEWEL audio-visual dataset demonstrate that compared with phonemic HMM approaches, facial parameters estimated by our approach follow the true parameters more accurately, and the synthesized facial animation sequences are so lively that 38% of them are undistinguishable.
KW - Articulatory model
KW - Baum-Welch DBN inversion (DBNI)
KW - Dynamic Bayesian networks (DBNs)
KW - Facial animation
KW - Mouth-synching
KW - Talking face
UR - http://www.scopus.com/inward/record.url?scp=33947583073&partnerID=8YFLogxK
U2 - 10.1109/TMM.2006.888009
DO - 10.1109/TMM.2006.888009
M3 - 文章
AN - SCOPUS:33947583073
SN - 1520-9210
VL - 9
SP - 500
EP - 510
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
IS - 3
ER -