TY - GEN
T1 - StreamFlow
T2 - 20th National Conference on Man-Machine Speech Communication, NCMMSC 2025
AU - Guo, Dake
AU - Yao, Jixun
AU - Ma, Lihan
AU - Wang, He
AU - Xie, Lei
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2026.
PY - 2026
Y1 - 2026
N2 - Recent advancements in discrete token-based speech generation have highlighted the importance of token-to-waveform generation for audio quality, particularly in real-time interactions. Traditional frameworks integrating semantic tokens with flow matching (FM) struggle with streaming capabilities due to their reliance on a global receptive field. Additionally, directly implementing token-by-token streaming speech generation often results in degraded audio quality. To address these challenges, we propose StreamFlow, a novel neural architecture that facilitates streaming flow matching with diffusion transformers (DiT). To mitigate the long-sequence extrapolation issues arising from lengthy historical dependencies, we design a local block-wise receptive field strategy. Specifically, the sequence is first segmented into blocks, and we introduce block-wise attention masks that enable the current block to receive information from the previous or subsequent block. These attention masks are combined hierarchically across different DiT-blocks to regulate the receptive field of DiTs. Both subjective and objective experimental results demonstrate that our approach achieves performance comparable to non-streaming methods while surpassing other streaming methods in terms of speech quality, all the while effectively managing inference time during long-sequence generation. Furthermore, our method achieves a notable first-packet latency of only 180 ms (Speech samples: https://dukguo.github.io/StreamFlow/).
AB - Recent advancements in discrete token-based speech generation have highlighted the importance of token-to-waveform generation for audio quality, particularly in real-time interactions. Traditional frameworks integrating semantic tokens with flow matching (FM) struggle with streaming capabilities due to their reliance on a global receptive field. Additionally, directly implementing token-by-token streaming speech generation often results in degraded audio quality. To address these challenges, we propose StreamFlow, a novel neural architecture that facilitates streaming flow matching with diffusion transformers (DiT). To mitigate the long-sequence extrapolation issues arising from lengthy historical dependencies, we design a local block-wise receptive field strategy. Specifically, the sequence is first segmented into blocks, and we introduce block-wise attention masks that enable the current block to receive information from the previous or subsequent block. These attention masks are combined hierarchically across different DiT-blocks to regulate the receptive field of DiTs. Both subjective and objective experimental results demonstrate that our approach achieves performance comparable to non-streaming methods while surpassing other streaming methods in terms of speech quality, all the while effectively managing inference time during long-sequence generation. Furthermore, our method achieves a notable first-packet latency of only 180 ms (Speech samples: https://dukguo.github.io/StreamFlow/).
KW - block-wise attention mask
KW - speech token decoding
KW - streaming flow matching
UR - https://www.scopus.com/pages/publications/105027938368
U2 - 10.1007/978-981-95-5382-2_7
DO - 10.1007/978-981-95-5382-2_7
M3 - 会议稿件
AN - SCOPUS:105027938368
SN - 9789819553815
T3 - Communications in Computer and Information Science
SP - 75
EP - 86
BT - Man-Machine Speech Communication - 20th National Conference, NCMMSC 2025, Proceedings
A2 - Jia, Jia
A2 - Wu, Zhiyong
A2 - Gao, Lijian
A2 - Huang, Gongping
A2 - Li, Ya
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 16 October 2025 through 19 October 2025
ER -