TY - GEN
T1 - Dialospeech
T2 - 17th Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2025
AU - Xie, Hanke
AU - Guo, Dake
AU - Wang, Chengyou
AU - Li, Yue
AU - Tian, Wenjie
AU - Zhu, Xinfa
AU - Wang, Xinsheng
AU - Li, Xiulin
AU - Miao, Guanqiong
AU - Liu, Bo
AU - Xie, Lei
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Recent advances in text-to-speech (TTS) synthesis, particularly those leveraging large language models (LLMs), have significantly improved expressiveness and naturalness. However, generating human-like, interactive dialogue speech remains challenging. Current systems face limitations due to the scarcity of dual-track data and difficulties in achieving naturalness, contextual coherence, and interactional dynamics, such as turntaking, overlapping speech, and speaker consistency, in multiturn conversations. To address these challenges, we propose DialoSpeech 11Codes and checkpoints will be publicly released., a dual-track architecture combining a large language model with Chunked Flow Matching for expressive, humanlike dialogue speech synthesis. DialoSpeech generates natural multi-turn conversations with coherent speaker turns and natural overlaps, supporting both Chinese and English and crosslingual speech synthesis. We introduce a data processing pipeline to construct dual-track dialogue datasets, facilitating scalable training and experimental validation. Experiments show that our model outperforms baselines, offering a solution for generating human-like spoken dialogues. Audio samples are available at https://tiamojames.github.io/DialoSpeech/
AB - Recent advances in text-to-speech (TTS) synthesis, particularly those leveraging large language models (LLMs), have significantly improved expressiveness and naturalness. However, generating human-like, interactive dialogue speech remains challenging. Current systems face limitations due to the scarcity of dual-track data and difficulties in achieving naturalness, contextual coherence, and interactional dynamics, such as turntaking, overlapping speech, and speaker consistency, in multiturn conversations. To address these challenges, we propose DialoSpeech 11Codes and checkpoints will be publicly released., a dual-track architecture combining a large language model with Chunked Flow Matching for expressive, humanlike dialogue speech synthesis. DialoSpeech generates natural multi-turn conversations with coherent speaker turns and natural overlaps, supporting both Chinese and English and crosslingual speech synthesis. We introduce a data processing pipeline to construct dual-track dialogue datasets, facilitating scalable training and experimental validation. Experiments show that our model outperforms baselines, offering a solution for generating human-like spoken dialogues. Audio samples are available at https://tiamojames.github.io/DialoSpeech/
KW - Dialogue Generation
KW - Flow Matching
KW - Language Models
UR - https://www.scopus.com/pages/publications/105030472335
U2 - 10.1109/APSIPAASC65261.2025.11249327
DO - 10.1109/APSIPAASC65261.2025.11249327
M3 - 会议稿件
AN - SCOPUS:105030472335
T3 - 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2025
SP - 807
EP - 812
BT - 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2025
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 22 October 2025 through 24 October 2025
ER -