摘要
Head motion naturally occurs in synchrony with speech and carries important intention, attitude and emotion factors. This paper aims to synthesize head motions from natural speech for talking avatar applications. Specifically, we study the feasibility of learning speech-to-head-motion regression models by two types of popular neural networks, i.e., feed-forward and bidirectional long short-term memory (BLSTM). We discover that the BLSTM networks apparently outperform the feedforward ones in this task because of their capacity of learning long-range speech dynamics. More interestingly, we observe that stacking different networks, i.e., inserting a feed-forward layer into two BLSTM layers, achieves the best performance. Subjective evaluation shows that this hybrid network can produce more plausible head motions from speech.
| 源语言 | 英语 |
|---|---|
| 页(从-至) | 3345-3349 |
| 页数 | 5 |
| 期刊 | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
| 卷 | 2015-January |
| DOI | |
| 出版状态 | 已出版 - 2015 |
| 活动 | 16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015 - Dresden, 德国 期限: 6 9月 2015 → 10 9月 2015 |
指纹
探究 'BLSTM neural networks for speech driven head motion synthesis' 的科研主题。它们共同构成独一无二的指纹。引用此
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver