TY - GEN
T1 - A kullback-leibler divergence based recurrent mixture density network for acoustic modeling in emotional statistical parametric speech synthesis
AU - An, Xiaochun
AU - Zhang, Yuchao
AU - Liu, Bing
AU - Xue, Liumeng
AU - Xie, Lei
N1 - Publisher Copyright:
© 2018 Association for Computing Machinery.
PY - 2018/10/19
Y1 - 2018/10/19
N2 - This paper proposes a Kullback-Leibler divergence (KLD) based recurrent mixture density network (RMDN) approach for acoustic modeling in emotional statistical parametric speech synthesis (SPSS), which aims at improving model accuracy and emotion naturalness. First, to improve model accuracy, we propose to use RMDN as acoustic model, which combines an LSTM with a mixture density network (MDN). Adding mixture density layer allows us to do multimodal regression as well as to predict variances, thus modeling more accurate probability density functions of acoustic features. Second, we further introduce Kullback-Leibler divergence regularization in model training. Inspired by KLD’s success in acoustic model adaptation, we aim to improve the emotion naturalness by maximizing the distances between the distributions of emotional speech and neutral speech. Objective and subjective evaluations show that the proposed approach improves the prediction accuracy of acoustic features and the naturalness of the synthesized emotional speech.
AB - This paper proposes a Kullback-Leibler divergence (KLD) based recurrent mixture density network (RMDN) approach for acoustic modeling in emotional statistical parametric speech synthesis (SPSS), which aims at improving model accuracy and emotion naturalness. First, to improve model accuracy, we propose to use RMDN as acoustic model, which combines an LSTM with a mixture density network (MDN). Adding mixture density layer allows us to do multimodal regression as well as to predict variances, thus modeling more accurate probability density functions of acoustic features. Second, we further introduce Kullback-Leibler divergence regularization in model training. Inspired by KLD’s success in acoustic model adaptation, we aim to improve the emotion naturalness by maximizing the distances between the distributions of emotional speech and neutral speech. Objective and subjective evaluations show that the proposed approach improves the prediction accuracy of acoustic features and the naturalness of the synthesized emotional speech.
KW - Emotional statistical parametric speech synthesis
KW - KLD-RMDN
KW - LSTM
KW - Recurrent mixture density network
UR - http://www.scopus.com/inward/record.url?scp=85061697082&partnerID=8YFLogxK
U2 - 10.1145/3267935.3267949
DO - 10.1145/3267935.3267949
M3 - 会议稿件
AN - SCOPUS:85061697082
T3 - ASMMC-MMAC 2018 - Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and 1st Multi-Modal Affective Computing of Large-Scale Multimedia Data, Co-located with MM 2018
SP - 1
EP - 6
BT - ASMMC-MMAC 2018 - Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and 1st Multi-Modal Affective Computing of Large-Scale Multimedia Data, Co-located with MM 2018
PB - Association for Computing Machinery, Inc
T2 - Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and 1st Multi-Modal Affective Computing of Large-Scale Multimedia Data Workshop, ASMMC-MMAC 2018
Y2 - 26 October 2018
ER -