TY - GEN
T1 - Multi-granularity Semantic and Acoustic Stress Prediction for Expressive TTS
AU - Chi, Wenjiang
AU - Feng, Xiaoqin
AU - Xue, Liumeng
AU - Chen, Yunlin
AU - Xie, Lei
AU - Li, Zhifei
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Stress, as the perceptual prominence within sentences, plays a key role in expressive text-to-speech (TTS). It can be either the semantic focus in text or the acoustic prominence in speech. However, stress labels are always annotated by listening to the speech, lacking semantic information in the corresponding text, which may degrade the accuracy of stress prediction and the expressivity of TTS. This paper proposes a multi-granularity stress prediction method for expressive TTS. Specifically, we first build Chinese Mandarin datasets with both coarse-grained semantic stress and fine-grained acoustic stress. Then, the proposed model progressively predicts semantic stress and acoustic stress. Finally, a TTS model is adopted to synthesize speech with the predicted stress. Experimental results on the proposed model and synthesized speech show that our proposed model achieves good accuracy in stress prediction and improves the expressiveness and naturalness of the synthesized speech.
AB - Stress, as the perceptual prominence within sentences, plays a key role in expressive text-to-speech (TTS). It can be either the semantic focus in text or the acoustic prominence in speech. However, stress labels are always annotated by listening to the speech, lacking semantic information in the corresponding text, which may degrade the accuracy of stress prediction and the expressivity of TTS. This paper proposes a multi-granularity stress prediction method for expressive TTS. Specifically, we first build Chinese Mandarin datasets with both coarse-grained semantic stress and fine-grained acoustic stress. Then, the proposed model progressively predicts semantic stress and acoustic stress. Finally, a TTS model is adopted to synthesize speech with the predicted stress. Experimental results on the proposed model and synthesized speech show that our proposed model achieves good accuracy in stress prediction and improves the expressiveness and naturalness of the synthesized speech.
UR - http://www.scopus.com/inward/record.url?scp=85180012939&partnerID=8YFLogxK
U2 - 10.1109/APSIPAASC58517.2023.10317138
DO - 10.1109/APSIPAASC58517.2023.10317138
M3 - 会议稿件
AN - SCOPUS:85180012939
T3 - 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023
SP - 2409
EP - 2415
BT - 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023
Y2 - 31 October 2023 through 3 November 2023
ER -