Abstract
Integrating reinforcement learning to align generated speech with human preferences has proven effective in improving the robustness of modern text-to-speech (TTS) systems. Current approaches primarily rely on preference data annotated at the utterance level. However, frequent issues affecting the listening experience often arise only in specific segments of audio, while other segments may be well-generated and require no correction. This mismatch between coarse-grained supervision and fine-grained quality variation limits the effectiveness of preferencebased optimization. In this study, we propose a fine-grained preference optimization approach (FPO) to enhance the robustness of TTS systems. FPO shifts the optimization paradigm from global utterance-level tuning to targeted local refinement, focusing on addressing localized issues in generated samples rather than uniformly optimizing the entire utterance. We begin by analyzing the types of common generation issues and categorizing them into temporal modeling errors and semanticphonetic alignment errors, which frequently degrade intelligibility and naturalness. To tackle these problems, we introduce a selective training loss strategy that leverages fine-grained labels for each issue type, allowing the model to focus on learning signals where they are most needed. Experimental results demonstrate that FPO substantially improves the robustness of zero-shot TTS systems by effectively correcting problematic regions in the output. This leads to a significant reduction in the bad case ratio, improved intelligibility, and overall perceptual quality. Moreover, FPO exhibits strong data efficiency, achieving comparable or superior performance to baseline methods while requiring notably fewer training samples.
| Original language | English |
|---|---|
| Pages (from-to) | 557-566 |
| Number of pages | 10 |
| Journal | IEEE Transactions on Audio, Speech and Language Processing |
| Volume | 34 |
| DOIs | |
| State | Published - 2026 |
Keywords
- Preference optimization
- speech synthesis
- text-to-speech
Fingerprint
Dive into the research topics of 'FPO: Fine-Grained Preference Optimization Improves Zero-Shot Text-to-Speech'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver