跳到主要导航 跳到搜索 跳到主要内容

HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis

  • Ziyu Zhang
  • , Hanzhao Li
  • , Jingbin Hu
  • , Wenhao Li
  • , Lei Xie
  • Northwestern Polytechnical University Xian

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Controllable speech synthesis refers to the precise control of speaking style by manipulating specific prosodic and paralinguistic attributes, such as gender, volume, speech rate, pitch, and pitch fluctuation. With the integration of advanced generative models, particularly large language models (LLMs) and diffusion models, controllable text-to-speech (TTS) systems have increasingly transitioned from label-based control to natural language description-based control, which is typically implemented by predicting global style embeddings from textual prompts. However, this straightforward prediction overlooks the underlying distribution of the style embeddings, which may hinder the full potential of controllable TTS systems. In this study, we use t-SNE analysis to visualize and analyze the global style embedding distribution of various mainstream TTS systems, revealing a clear hierarchical clustering pattern: embeddings first cluster by timbre and subsequently subdivide into finer clusters based on style attributes. Based on this observation, we propose HiStyle, a two-stage style embedding predictor that hierarchically predicts style embeddings conditioned on textual prompts, and further incorporate contrastive learning to help align the text and audio embedding spaces. Additionally, we propose a style annotation strategy that leverages the complementary strengths of statistical methodologies and human auditory preferences to generate more accurate and perceptually consistent textual prompts for style control. Comprehensive experiments demonstrate that when applied to the base TTS model, HiStyle achieves significantly better style controllability than alternative style embedding predicting approaches while preserving high speech quality in terms of naturalness and intelligibility. Audio samples are available at https://anonymous.4open.science/w/HiStyle-2517/.

源语言英语
主期刊名Man-Machine Speech Communication - 20th National Conference, NCMMSC 2025, Proceedings
编辑Jia Jia, Zhiyong Wu, Lijian Gao, Gongping Huang, Ya Li
出版商Springer Science and Business Media Deutschland GmbH
522-535
页数14
ISBN(印刷版)9789819553815
DOI
出版状态已出版 - 2026
活动20th National Conference on Man-Machine Speech Communication, NCMMSC 2025 - Zhenjiang, 中国
期限: 16 10月 202519 10月 2025

出版系列

姓名Communications in Computer and Information Science
2662 CCIS
ISSN(印刷版)1865-0929
ISSN(电子版)1865-0937

会议

会议20th National Conference on Man-Machine Speech Communication, NCMMSC 2025
国家/地区中国
Zhenjiang
时期16/10/2519/10/25

指纹

探究 'HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis' 的科研主题。它们共同构成独一无二的指纹。

引用此