A comparative study of audio features for audio-to-visual conversion in MPEG-4 compliant facial animation

Lei Xie; Zhi Qiang Liu

doi:10.1109/ICMLC.2006.259085

A comparative study of audio features for audio-to-visual conversion in MPEG-4 compliant facial animation

Lei Xie, Zhi Qiang Liu

City University of Hong Kong

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

21 引用（Scopus）

摘要

Audio-to-visual conversion is the basic problem of speech-driven facial animation. Since the conversion problem is to predict facial control parameters from the acoustic speech, the informative representation of audio, i.e., the audio feature, is important to get a good prediction. This paper presents a performance comparison on prosodic features, articulatory features, and perceptual features for the audio-to-visual conversion problem on a common test bed. Experimental results show that the Mel frequency cepstral coefficients (MFCCs) produce the best performance, followed by the perceptual linear prediction coefficients (PLPC), the linear predictive cepstral coefficients (LPCCs), and the prosodie feature set (F₀) and energy). The combination of three kinds of features can further improve the prediction performance on facial parameters. It unveils that different audio features carry complementary information relevant to facial animation.

源语言	英语
主期刊名	Proceedings of the 2006 International Conference on Machine Learning and Cybernetics
页	4359-4364
页数	6
DOI	https://doi.org/10.1109/ICMLC.2006.259085
出版状态	已出版 - 2006
已对外发布	是
活动	2006 International Conference on Machine Learning and Cybernetics - Dalian, 中国期限: 13 8月 2006 → 16 8月 2006

出版系列

姓名	Proceedings of the 2006 International Conference on Machine Learning and Cybernetics
卷	2006

会议

会议	2006 International Conference on Machine Learning and Cybernetics
国家/地区	中国
市	Dalian
时期	13/08/06 → 16/08/06

访问文件

10.1109/ICMLC.2006.259085

其它文件与链接

链接到 Scopus 的出版物

引用此

Xie, L., & Liu, Z. Q. (2006). A comparative study of audio features for audio-to-visual conversion in MPEG-4 compliant facial animation. 在 Proceedings of the 2006 International Conference on Machine Learning and Cybernetics (页码 4359-4364). 文章 4028840 (Proceedings of the 2006 International Conference on Machine Learning and Cybernetics; 卷 2006). https://doi.org/10.1109/ICMLC.2006.259085

@inproceedings{ef5c552aabb641398e0f9def71e4869c,

title = "A comparative study of audio features for audio-to-visual conversion in MPEG-4 compliant facial animation",

abstract = "Audio-to-visual conversion is the basic problem of speech-driven facial animation. Since the conversion problem is to predict facial control parameters from the acoustic speech, the informative representation of audio, i.e., the audio feature, is important to get a good prediction. This paper presents a performance comparison on prosodic features, articulatory features, and perceptual features for the audio-to-visual conversion problem on a common test bed. Experimental results show that the Mel frequency cepstral coefficients (MFCCs) produce the best performance, followed by the perceptual linear prediction coefficients (PLPC), the linear predictive cepstral coefficients (LPCCs), and the prosodie feature set (F0) and energy). The combination of three kinds of features can further improve the prediction performance on facial parameters. It unveils that different audio features carry complementary information relevant to facial animation.",

keywords = "Audio features, Audio-to-visual conversion, Facial animation, MPEG-4, Talking face",

author = "Lei Xie and Liu, {Zhi Qiang}",

year = "2006",

doi = "10.1109/ICMLC.2006.259085",

language = "英语",

isbn = "1424400619",

series = "Proceedings of the 2006 International Conference on Machine Learning and Cybernetics",

pages = "4359--4364",

booktitle = "Proceedings of the 2006 International Conference on Machine Learning and Cybernetics",

note = "2006 International Conference on Machine Learning and Cybernetics ; Conference date: 13-08-2006 Through 16-08-2006",

}

Xie, L & Liu, ZQ 2006, A comparative study of audio features for audio-to-visual conversion in MPEG-4 compliant facial animation. 在 Proceedings of the 2006 International Conference on Machine Learning and Cybernetics., 4028840, Proceedings of the 2006 International Conference on Machine Learning and Cybernetics, 卷 2006, 页码 4359-4364, 2006 International Conference on Machine Learning and Cybernetics, Dalian, 中国, 13/08/06. https://doi.org/10.1109/ICMLC.2006.259085

A comparative study of audio features for audio-to-visual conversion in MPEG-4 compliant facial animation. / Xie, Lei; Liu, Zhi Qiang.
Proceedings of the 2006 International Conference on Machine Learning and Cybernetics. 2006. 页码 4359-4364 4028840 (Proceedings of the 2006 International Conference on Machine Learning and Cybernetics; 卷 2006).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - A comparative study of audio features for audio-to-visual conversion in MPEG-4 compliant facial animation

AU - Xie, Lei

AU - Liu, Zhi Qiang

PY - 2006

Y1 - 2006

N2 - Audio-to-visual conversion is the basic problem of speech-driven facial animation. Since the conversion problem is to predict facial control parameters from the acoustic speech, the informative representation of audio, i.e., the audio feature, is important to get a good prediction. This paper presents a performance comparison on prosodic features, articulatory features, and perceptual features for the audio-to-visual conversion problem on a common test bed. Experimental results show that the Mel frequency cepstral coefficients (MFCCs) produce the best performance, followed by the perceptual linear prediction coefficients (PLPC), the linear predictive cepstral coefficients (LPCCs), and the prosodie feature set (F0) and energy). The combination of three kinds of features can further improve the prediction performance on facial parameters. It unveils that different audio features carry complementary information relevant to facial animation.

AB - Audio-to-visual conversion is the basic problem of speech-driven facial animation. Since the conversion problem is to predict facial control parameters from the acoustic speech, the informative representation of audio, i.e., the audio feature, is important to get a good prediction. This paper presents a performance comparison on prosodic features, articulatory features, and perceptual features for the audio-to-visual conversion problem on a common test bed. Experimental results show that the Mel frequency cepstral coefficients (MFCCs) produce the best performance, followed by the perceptual linear prediction coefficients (PLPC), the linear predictive cepstral coefficients (LPCCs), and the prosodie feature set (F0) and energy). The combination of three kinds of features can further improve the prediction performance on facial parameters. It unveils that different audio features carry complementary information relevant to facial animation.

KW - Audio features

KW - Audio-to-visual conversion

KW - Facial animation

KW - MPEG-4

KW - Talking face

UR - http://www.scopus.com/inward/record.url?scp=33947224137&partnerID=8YFLogxK

U2 - 10.1109/ICMLC.2006.259085

DO - 10.1109/ICMLC.2006.259085

M3 - 会议稿件

AN - SCOPUS:33947224137

SN - 1424400619

SN - 9781424400614

T3 - Proceedings of the 2006 International Conference on Machine Learning and Cybernetics

SP - 4359

EP - 4364

BT - Proceedings of the 2006 International Conference on Machine Learning and Cybernetics

T2 - 2006 International Conference on Machine Learning and Cybernetics

Y2 - 13 August 2006 through 16 August 2006

ER -

A comparative study of audio features for audio-to-visual conversion in MPEG-4 compliant facial animation

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此