Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation

Hanzhao Li; Liumeng Xue; Haohan Guo; Xinfa Zhu; Yuanjun Lv; Lei Xie; Yunlin Chen; Hao Yin; Zhifei Li

doi:10.21437/Interspeech.2024-1559

Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation

Hanzhao Li, Liumeng Xue, Haohan Guo, Xinfa Zhu, Yuanjun Lv, Lei Xie, Yunlin Chen, Hao Yin, Zhifei Li

计算机学院

科研成果: 期刊稿件 › 会议文章 › 同行评审

8 引用（Scopus）

摘要

The multi-codebook speech codec enables the application of large language models (LLM) in TTS but bottlenecks efficiency and robustness due to multi-sequence prediction. To avoid this obstacle, we propose Single-Codec, a single-codebook single-sequence codec, which employs a disentangled VQ-VAE to decouple speech into a time-invariant embedding and a phonetically-rich discrete sequence. Furthermore, the encoder is enhanced with 1) contextual modeling with a BLSTM module to exploit the temporal information, 2) a hybrid sampling module to alleviate distortion from upsampling and downsampling, and 3) a resampling module to encourage discrete units to carry more phonetic information. Compared with multi-codebook codecs, e.g., EnCodec and TiCodec, Single-Codec demonstrates higher reconstruction quality with a lower bandwidth of only 304bps. The effectiveness of Single-Code is further validated by LLM-TTS experiments, showing improved naturalness and intelligibility.

源语言	英语
页（从-至）	3390-3394
页数	5
期刊	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOI	https://doi.org/10.21437/Interspeech.2024-1559
出版状态	已出版 - 2024
活动	25th Interspeech Conferece 2024 - Kos Island, 希腊期限: 1 9月 2024 → 5 9月 2024

访问文件

10.21437/Interspeech.2024-1559

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{6a51be92a8bb4b26bd0bf2e0f2e94886,

title = "Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation",

abstract = "The multi-codebook speech codec enables the application of large language models (LLM) in TTS but bottlenecks efficiency and robustness due to multi-sequence prediction. To avoid this obstacle, we propose Single-Codec, a single-codebook single-sequence codec, which employs a disentangled VQ-VAE to decouple speech into a time-invariant embedding and a phonetically-rich discrete sequence. Furthermore, the encoder is enhanced with 1) contextual modeling with a BLSTM module to exploit the temporal information, 2) a hybrid sampling module to alleviate distortion from upsampling and downsampling, and 3) a resampling module to encourage discrete units to carry more phonetic information. Compared with multi-codebook codecs, e.g., EnCodec and TiCodec, Single-Codec demonstrates higher reconstruction quality with a lower bandwidth of only 304bps. The effectiveness of Single-Code is further validated by LLM-TTS experiments, showing improved naturalness and intelligibility.",

keywords = "language model, single-codebook codec, speech codec, text-to-speech",

author = "Hanzhao Li and Liumeng Xue and Haohan Guo and Xinfa Zhu and Yuanjun Lv and Lei Xie and Yunlin Chen and Hao Yin and Zhifei Li",

note = "Publisher Copyright: {\textcopyright} 2024 International Speech Communication Association. All rights reserved.; 25th Interspeech Conferece 2024 ; Conference date: 01-09-2024 Through 05-09-2024",

year = "2024",

doi = "10.21437/Interspeech.2024-1559",

language = "英语",

pages = "3390--3394",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - Single-Codec

T2 - 25th Interspeech Conferece 2024

AU - Li, Hanzhao

AU - Xue, Liumeng

AU - Guo, Haohan

AU - Zhu, Xinfa

AU - Lv, Yuanjun

AU - Xie, Lei

AU - Chen, Yunlin

AU - Yin, Hao

AU - Li, Zhifei

PY - 2024

Y1 - 2024

N2 - The multi-codebook speech codec enables the application of large language models (LLM) in TTS but bottlenecks efficiency and robustness due to multi-sequence prediction. To avoid this obstacle, we propose Single-Codec, a single-codebook single-sequence codec, which employs a disentangled VQ-VAE to decouple speech into a time-invariant embedding and a phonetically-rich discrete sequence. Furthermore, the encoder is enhanced with 1) contextual modeling with a BLSTM module to exploit the temporal information, 2) a hybrid sampling module to alleviate distortion from upsampling and downsampling, and 3) a resampling module to encourage discrete units to carry more phonetic information. Compared with multi-codebook codecs, e.g., EnCodec and TiCodec, Single-Codec demonstrates higher reconstruction quality with a lower bandwidth of only 304bps. The effectiveness of Single-Code is further validated by LLM-TTS experiments, showing improved naturalness and intelligibility.

AB - The multi-codebook speech codec enables the application of large language models (LLM) in TTS but bottlenecks efficiency and robustness due to multi-sequence prediction. To avoid this obstacle, we propose Single-Codec, a single-codebook single-sequence codec, which employs a disentangled VQ-VAE to decouple speech into a time-invariant embedding and a phonetically-rich discrete sequence. Furthermore, the encoder is enhanced with 1) contextual modeling with a BLSTM module to exploit the temporal information, 2) a hybrid sampling module to alleviate distortion from upsampling and downsampling, and 3) a resampling module to encourage discrete units to carry more phonetic information. Compared with multi-codebook codecs, e.g., EnCodec and TiCodec, Single-Codec demonstrates higher reconstruction quality with a lower bandwidth of only 304bps. The effectiveness of Single-Code is further validated by LLM-TTS experiments, showing improved naturalness and intelligibility.

KW - language model

KW - single-codebook codec

KW - speech codec

KW - text-to-speech

UR - http://www.scopus.com/inward/record.url?scp=85205718844&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2024-1559

DO - 10.21437/Interspeech.2024-1559

M3 - 会议文章

AN - SCOPUS:85205718844

SN - 2308-457X

SP - 3390

EP - 3394

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Y2 - 1 September 2024 through 5 September 2024

ER -

Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation

摘要

访问文件

其它文件与链接

指纹

引用此