跳到主要导航 跳到搜索 跳到主要内容

Llasa+: Free Lunch for Accelerated and Streaming Llama-Based Speech Synthesis

  • Wenjie Tian
  • , Xinfa Zhu
  • , Hanke Xie
  • , Zhen Ye
  • , Wei Xue
  • , Lei Xie
  • Northwestern Polytechnical University Xian
  • Hong Kong University of Science and Technology

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Recent progress in text-to-speech (TTS) has achieved impressive naturalness and flexibility, especially with the development of large language model (LLM)-based approaches. However, existing autoregressive (AR) structures and large-scale models, such as Llasa, still face significant challenges in inference latency and streaming synthesis. To deal with the limitations, we introduce Llasa+, an accelerated and streaming TTS model built on Llasa. Specifically, to accelerate the generation process, we introduce two plug-and-play Multi-Token Prediction (MTP) modules following the frozen backbone. These modules allow the model to predict multiple tokens in one AR step. Additionally, to mitigate potential error propagation caused by inaccurate MTP, we design a novel verification algorithm that leverages the frozen backbone to validate the generated tokens, thus allowing Llasa+ to achieve speedup without sacrificing generation quality. Furthermore, we design a causal decoder that enables streaming speech reconstruction from tokens. Extensive experiments show that Llasa+ achieves a 1.48 × speedup without sacrificing generation quality, despite being trained only on LibriTTS. Moreover, the MTP-and-verification framework can be applied to accelerate any LLM-based model. All codes and models are publicly available at https://github.com/ASLP-lab/LLaSA_Plus.

源语言英语
主期刊名ASRU 2025 - 2025 IEEE Automatic Speech Recognition and Understanding Workshop
出版商Institute of Electrical and Electronics Engineers Inc.
ISBN(电子版)9798331544263
DOI
出版状态已出版 - 2025
活动2025 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2025 - Honolulu, 美国
期限: 6 12月 202510 12月 2025

出版系列

姓名ASRU 2025 - 2025 IEEE Automatic Speech Recognition and Understanding Workshop

会议

会议2025 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2025
国家/地区美国
Honolulu
时期6/12/2510/12/25

指纹

探究 'Llasa+: Free Lunch for Accelerated and Streaming Llama-Based Speech Synthesis' 的科研主题。它们共同构成独一无二的指纹。

引用此