Skip to main navigation Skip to search Skip to main content

FPO: Fine-Grained Preference Optimization Improves Zero-Shot Text-to-Speech

  • Jixun Yao
  • , Yang Yuguang
  • , Yuan Feng
  • , Yu Pan
  • , Ziqian Ning
  • , Jianhao Ye
  • , Hongbin Zhou
  • , Lei Xie
  • Northwestern Polytechnical University Xian
  • The University of Hong Kong
  • Ltd

Research output: Contribution to journalArticlepeer-review

Abstract

Integrating reinforcement learning to align generated speech with human preferences has proven effective in improving the robustness of modern text-to-speech (TTS) systems. Current approaches primarily rely on preference data annotated at the utterance level. However, frequent issues affecting the listening experience often arise only in specific segments of audio, while other segments may be well-generated and require no correction. This mismatch between coarse-grained supervision and fine-grained quality variation limits the effectiveness of preferencebased optimization. In this study, we propose a fine-grained preference optimization approach (FPO) to enhance the robustness of TTS systems. FPO shifts the optimization paradigm from global utterance-level tuning to targeted local refinement, focusing on addressing localized issues in generated samples rather than uniformly optimizing the entire utterance. We begin by analyzing the types of common generation issues and categorizing them into temporal modeling errors and semanticphonetic alignment errors, which frequently degrade intelligibility and naturalness. To tackle these problems, we introduce a selective training loss strategy that leverages fine-grained labels for each issue type, allowing the model to focus on learning signals where they are most needed. Experimental results demonstrate that FPO substantially improves the robustness of zero-shot TTS systems by effectively correcting problematic regions in the output. This leads to a significant reduction in the bad case ratio, improved intelligibility, and overall perceptual quality. Moreover, FPO exhibits strong data efficiency, achieving comparable or superior performance to baseline methods while requiring notably fewer training samples.

Original languageEnglish
Pages (from-to)557-566
Number of pages10
JournalIEEE Transactions on Audio, Speech and Language Processing
Volume34
DOIs
StatePublished - 2026

Keywords

  • Preference optimization
  • speech synthesis
  • text-to-speech

Fingerprint

Dive into the research topics of 'FPO: Fine-Grained Preference Optimization Improves Zero-Shot Text-to-Speech'. Together they form a unique fingerprint.

Cite this