Multi-granularity Semantic and Acoustic Stress Prediction for Expressive TTS

Wenjiang Chi; Xiaoqin Feng; Liumeng Xue; Yunlin Chen; Lei Xie; Zhifei Li

doi:10.1109/APSIPAASC58517.2023.10317138

Multi-granularity Semantic and Acoustic Stress Prediction for Expressive TTS

Wenjiang Chi, Xiaoqin Feng, Liumeng Xue, Yunlin Chen, Lei Xie, Zhifei Li

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

Stress, as the perceptual prominence within sentences, plays a key role in expressive text-to-speech (TTS). It can be either the semantic focus in text or the acoustic prominence in speech. However, stress labels are always annotated by listening to the speech, lacking semantic information in the corresponding text, which may degrade the accuracy of stress prediction and the expressivity of TTS. This paper proposes a multi-granularity stress prediction method for expressive TTS. Specifically, we first build Chinese Mandarin datasets with both coarse-grained semantic stress and fine-grained acoustic stress. Then, the proposed model progressively predicts semantic stress and acoustic stress. Finally, a TTS model is adopted to synthesize speech with the predicted stress. Experimental results on the proposed model and synthesized speech show that our proposed model achieves good accuracy in stress prediction and improves the expressiveness and naturalness of the synthesized speech.

Original language	English
Title of host publication	2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	2409-2415
Number of pages	7
ISBN (Electronic)	9798350300673
DOIs	https://doi.org/10.1109/APSIPAASC58517.2023.10317138
State	Published - 2023
Externally published	Yes
Event	2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023 - Taipei, Taiwan, Province of China Duration: 31 Oct 2023 → 3 Nov 2023

Publication series

Name	2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023

Conference

Conference	2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023
Country/Territory	Taiwan, Province of China
City	Taipei
Period	31/10/23 → 3/11/23

Access to Document

10.1109/APSIPAASC58517.2023.10317138

Cite this

Chi, W., Feng, X., Xue, L., Chen, Y., Xie, L., & Li, Z. (2023). Multi-granularity Semantic and Acoustic Stress Prediction for Expressive TTS. In 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023 (pp. 2409-2415). (2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/APSIPAASC58517.2023.10317138

Chi, Wenjiang ; Feng, Xiaoqin ; Xue, Liumeng et al. / Multi-granularity Semantic and Acoustic Stress Prediction for Expressive TTS. 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023. Institute of Electrical and Electronics Engineers Inc., 2023. pp. 2409-2415 (2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023).

@inproceedings{bda2a169f4ba4f1baecd96d9431d084b,

title = "Multi-granularity Semantic and Acoustic Stress Prediction for Expressive TTS",

abstract = "Stress, as the perceptual prominence within sentences, plays a key role in expressive text-to-speech (TTS). It can be either the semantic focus in text or the acoustic prominence in speech. However, stress labels are always annotated by listening to the speech, lacking semantic information in the corresponding text, which may degrade the accuracy of stress prediction and the expressivity of TTS. This paper proposes a multi-granularity stress prediction method for expressive TTS. Specifically, we first build Chinese Mandarin datasets with both coarse-grained semantic stress and fine-grained acoustic stress. Then, the proposed model progressively predicts semantic stress and acoustic stress. Finally, a TTS model is adopted to synthesize speech with the predicted stress. Experimental results on the proposed model and synthesized speech show that our proposed model achieves good accuracy in stress prediction and improves the expressiveness and naturalness of the synthesized speech.",

author = "Wenjiang Chi and Xiaoqin Feng and Liumeng Xue and Yunlin Chen and Lei Xie and Zhifei Li",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023 ; Conference date: 31-10-2023 Through 03-11-2023",

year = "2023",

doi = "10.1109/APSIPAASC58517.2023.10317138",

language = "英语",

series = "2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "2409--2415",

booktitle = "2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023",

}

Chi, W, Feng, X, Xue, L, Chen, Y, Xie, L & Li, Z 2023, Multi-granularity Semantic and Acoustic Stress Prediction for Expressive TTS. in 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023. 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023, Institute of Electrical and Electronics Engineers Inc., pp. 2409-2415, 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023, Taipei, Taiwan, Province of China, 31/10/23. https://doi.org/10.1109/APSIPAASC58517.2023.10317138

Multi-granularity Semantic and Acoustic Stress Prediction for Expressive TTS. / Chi, Wenjiang; Feng, Xiaoqin; Xue, Liumeng et al.
2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023. Institute of Electrical and Electronics Engineers Inc., 2023. p. 2409-2415 (2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Multi-granularity Semantic and Acoustic Stress Prediction for Expressive TTS

AU - Chi, Wenjiang

AU - Feng, Xiaoqin

AU - Xue, Liumeng

AU - Chen, Yunlin

AU - Xie, Lei

AU - Li, Zhifei

PY - 2023

Y1 - 2023

N2 - Stress, as the perceptual prominence within sentences, plays a key role in expressive text-to-speech (TTS). It can be either the semantic focus in text or the acoustic prominence in speech. However, stress labels are always annotated by listening to the speech, lacking semantic information in the corresponding text, which may degrade the accuracy of stress prediction and the expressivity of TTS. This paper proposes a multi-granularity stress prediction method for expressive TTS. Specifically, we first build Chinese Mandarin datasets with both coarse-grained semantic stress and fine-grained acoustic stress. Then, the proposed model progressively predicts semantic stress and acoustic stress. Finally, a TTS model is adopted to synthesize speech with the predicted stress. Experimental results on the proposed model and synthesized speech show that our proposed model achieves good accuracy in stress prediction and improves the expressiveness and naturalness of the synthesized speech.

AB - Stress, as the perceptual prominence within sentences, plays a key role in expressive text-to-speech (TTS). It can be either the semantic focus in text or the acoustic prominence in speech. However, stress labels are always annotated by listening to the speech, lacking semantic information in the corresponding text, which may degrade the accuracy of stress prediction and the expressivity of TTS. This paper proposes a multi-granularity stress prediction method for expressive TTS. Specifically, we first build Chinese Mandarin datasets with both coarse-grained semantic stress and fine-grained acoustic stress. Then, the proposed model progressively predicts semantic stress and acoustic stress. Finally, a TTS model is adopted to synthesize speech with the predicted stress. Experimental results on the proposed model and synthesized speech show that our proposed model achieves good accuracy in stress prediction and improves the expressiveness and naturalness of the synthesized speech.

UR - http://www.scopus.com/inward/record.url?scp=85180012939&partnerID=8YFLogxK

U2 - 10.1109/APSIPAASC58517.2023.10317138

DO - 10.1109/APSIPAASC58517.2023.10317138

M3 - 会议稿件

AN - SCOPUS:85180012939

T3 - 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023

SP - 2409

EP - 2415

BT - 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023

Y2 - 31 October 2023 through 3 November 2023

ER -

Chi W, Feng X, Xue L, Chen Y, Xie L, Li Z. Multi-granularity Semantic and Acoustic Stress Prediction for Expressive TTS. In 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023. Institute of Electrical and Electronics Engineers Inc. 2023. p. 2409-2415. (2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023). doi: 10.1109/APSIPAASC58517.2023.10317138

Multi-granularity Semantic and Acoustic Stress Prediction for Expressive TTS

Abstract

Publication series

Conference

Access to Document

Other files and links

Fingerprint

Cite this