TY - GEN
T1 - HIGNN-TTS
T2 - 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
AU - Guo, Dake
AU - Zhu, Xinfa
AU - Xue, Liumeng
AU - Li, Tao
AU - Lv, Yuanjun
AU - Jiang, Yuepeng
AU - Xie, Lei
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Recent advances in text-to-speech, particularly those based on Graph Neural Networks (GNNs), have significantly improved the expressiveness of short-form synthetic speech. However, generating human-parity long-form speech with high dynamic prosodic variations is still challenging. To address this problem, we expand the capabilities of GNNs with a hierarchical prosody modeling approach, named HiGNNTTS. Specifically, we add a virtual global node in the graph to strengthen the interconnection of word nodes and introduce a contextual attention mechanism to broaden the prosody modeling scope of GNNs from intra-sentence to inter-sentence. Additionally, we perform hierarchical supervision from acoustic prosody on each node of the graph to capture the prosodic variations with a high dynamic range. Ablation studies show the effectiveness of HiGNN-TTS in learning hierarchical prosody. Both objective and subjective evaluations demonstrate that HiGNN-TTS significantly improves the naturalness and expressiveness of long-form synthetic speech11Speech samples: https://dukguo.github.io/HiGNN-TTS/
AB - Recent advances in text-to-speech, particularly those based on Graph Neural Networks (GNNs), have significantly improved the expressiveness of short-form synthetic speech. However, generating human-parity long-form speech with high dynamic prosodic variations is still challenging. To address this problem, we expand the capabilities of GNNs with a hierarchical prosody modeling approach, named HiGNNTTS. Specifically, we add a virtual global node in the graph to strengthen the interconnection of word nodes and introduce a contextual attention mechanism to broaden the prosody modeling scope of GNNs from intra-sentence to inter-sentence. Additionally, we perform hierarchical supervision from acoustic prosody on each node of the graph to capture the prosodic variations with a high dynamic range. Ablation studies show the effectiveness of HiGNN-TTS in learning hierarchical prosody. Both objective and subjective evaluations demonstrate that HiGNN-TTS significantly improves the naturalness and expressiveness of long-form synthetic speech11Speech samples: https://dukguo.github.io/HiGNN-TTS/
KW - Expressive long-form TTS
KW - graph neural network
KW - hierarchical prosody modeling
UR - http://www.scopus.com/inward/record.url?scp=85184666113&partnerID=8YFLogxK
U2 - 10.1109/ASRU57964.2023.10389629
DO - 10.1109/ASRU57964.2023.10389629
M3 - 会议稿件
AN - SCOPUS:85184666113
T3 - 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
BT - 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 16 December 2023 through 20 December 2023
ER -