StreamFlow: Streaming Flow Matching with Block-Wise Guided Attention Mask for Speech Token Decoding

  • Dake Guo
  • , Jixun Yao
  • , Lihan Ma
  • , He Wang
  • , Lei Xie

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Recent advancements in discrete token-based speech generation have highlighted the importance of token-to-waveform generation for audio quality, particularly in real-time interactions. Traditional frameworks integrating semantic tokens with flow matching (FM) struggle with streaming capabilities due to their reliance on a global receptive field. Additionally, directly implementing token-by-token streaming speech generation often results in degraded audio quality. To address these challenges, we propose StreamFlow, a novel neural architecture that facilitates streaming flow matching with diffusion transformers (DiT). To mitigate the long-sequence extrapolation issues arising from lengthy historical dependencies, we design a local block-wise receptive field strategy. Specifically, the sequence is first segmented into blocks, and we introduce block-wise attention masks that enable the current block to receive information from the previous or subsequent block. These attention masks are combined hierarchically across different DiT-blocks to regulate the receptive field of DiTs. Both subjective and objective experimental results demonstrate that our approach achieves performance comparable to non-streaming methods while surpassing other streaming methods in terms of speech quality, all the while effectively managing inference time during long-sequence generation. Furthermore, our method achieves a notable first-packet latency of only 180 ms (Speech samples: https://dukguo.github.io/StreamFlow/).

Original languageEnglish
Title of host publicationMan-Machine Speech Communication - 20th National Conference, NCMMSC 2025, Proceedings
EditorsJia Jia, Zhiyong Wu, Lijian Gao, Gongping Huang, Ya Li
PublisherSpringer Science and Business Media Deutschland GmbH
Pages75-86
Number of pages12
ISBN (Print)9789819553815
DOIs
StatePublished - 2026
Event20th National Conference on Man-Machine Speech Communication, NCMMSC 2025 - Zhenjiang, China
Duration: 16 Oct 202519 Oct 2025

Publication series

NameCommunications in Computer and Information Science
Volume2662 CCIS
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference20th National Conference on Man-Machine Speech Communication, NCMMSC 2025
Country/TerritoryChina
CityZhenjiang
Period16/10/2519/10/25

Keywords

  • block-wise attention mask
  • speech token decoding
  • streaming flow matching

Fingerprint

Dive into the research topics of 'StreamFlow: Streaming Flow Matching with Block-Wise Guided Attention Mask for Speech Token Decoding'. Together they form a unique fingerprint.

Cite this