Skip to main navigation Skip to search Skip to main content

KALL-E: Autoregressive Speech Synthesis with Next-Distribution Prediction

  • Kangxiang Xia
  • , Xinfa Zhu
  • , Jixun Yao
  • , Wenjie Tian
  • , Wenhao Li
  • , Lei Xie
  • Northwestern Polytechnical University Xian

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We introduce KALL-E, a novel autoregressive (AR) language model for text-to-speech (TTS) synthesis that operates by predicting the next distribution of continuous speech frames. Unlike existing methods, KALL-E directly models the continuous speech distribution conditioned on text, eliminating the need for any diffusion-based components. Specifically, we utilize a Flow-VAE to extract a continuous latent speech representation from waveforms, instead of relying on discrete speech tokens. A single AR Transformer is then trained to predict these continuous speech distributions from text, optimizing a Kullback–Leibler divergence loss as its objective. Experimental results demonstrate that KALL-E achieves superior speech synthesis quality and can even adapt to a target speaker from just a single sample. Importantly, KALL-E provides a more direct and effective approach for utilizing continuous speech representations in TTS.

Original languageEnglish
Title of host publicationProceedings of the AAAI Conference on Artificial Intelligence
EditorsSven Koenig, Chad Jenkins, Matthew E. Taylor
PublisherAssociation for the Advancement of Artificial Intelligence
Pages34016-34024
Number of pages9
Edition40
ISBN (Print)9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067
DOIs
StatePublished - 2026
Event40th AAAI Conference on Artificial Intelligence, AAAI 2026 - Singapore, Singapore
Duration: 20 Jan 202627 Jan 2026

Publication series

NameProceedings of the AAAI Conference on Artificial Intelligence
Number40
Volume40
ISSN (Print)2159-5399
ISSN (Electronic)2374-3468

Conference

Conference40th AAAI Conference on Artificial Intelligence, AAAI 2026
Country/TerritorySingapore
CitySingapore
Period20/01/2627/01/26

Fingerprint

Dive into the research topics of 'KALL-E: Autoregressive Speech Synthesis with Next-Distribution Prediction'. Together they form a unique fingerprint.

Cite this