SELM: SPEECH ENHANCEMENT USING DISCRETE TOKENS AND LANGUAGE MODELS

Ziqian Wang; Xinfa Zhu; Zihan Zhang; Yuan Jun Lv; Ning Jiang; Guoqing Zhao; Lei Xie

doi:10.1109/ICASSP48485.2024.10447464

SELM: SPEECH ENHANCEMENT USING DISCRETE TOKENS AND LANGUAGE MODELS

Ziqian Wang, Xinfa Zhu, Zihan Zhang, Yuan Jun Lv, Ning Jiang, Guoqing Zhao, Lei Xie

School of Computer Science

Research output: Contribution to journal › Conference article › peer-review

5 Scopus citations

Abstract

Language models (LMs) have recently shown superior performances in various speech generation tasks, demonstrating their powerful ability for semantic context modeling. Given the intrinsic similarity between speech generation and speech enhancement, harnessing semantic information is advantageous for speech enhancement tasks. In light of this, we propose SELM, a novel speech enhancement paradigm that integrates discrete tokens and leverages language models. SELM comprises three stages: encoding, modeling, and decoding. We transform continuous waveform signals into discrete tokens using pre-trained self-supervised learning (SSL) models and a k-means tokenizer. Language models then capture comprehensive contextual information within these tokens. Finally, a de-tokenizer and HiFi-GAN restore them into enhanced speech. Experimental results demonstrate that SELM achieves comparable performance in objective metrics and superior subjective perception results. Our demos are available.

Original language	English
Pages (from-to)	11561-11565
Number of pages	5
Journal	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
DOIs	https://doi.org/10.1109/ICASSP48485.2024.10447464
State	Published - 2024
Event	2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Seoul, Korea, Republic of Duration: 14 Apr 2024 → 19 Apr 2024

Keywords

generative model
language models
speech enhancement
staged approach

Access to Document

10.1109/ICASSP48485.2024.10447464

Cite this

@article{73cbf0ee316d4dce9ff9a6df9058d5ed,

title = "SELM: SPEECH ENHANCEMENT USING DISCRETE TOKENS AND LANGUAGE MODELS",

abstract = "Language models (LMs) have recently shown superior performances in various speech generation tasks, demonstrating their powerful ability for semantic context modeling. Given the intrinsic similarity between speech generation and speech enhancement, harnessing semantic information is advantageous for speech enhancement tasks. In light of this, we propose SELM, a novel speech enhancement paradigm that integrates discrete tokens and leverages language models. SELM comprises three stages: encoding, modeling, and decoding. We transform continuous waveform signals into discrete tokens using pre-trained self-supervised learning (SSL) models and a k-means tokenizer. Language models then capture comprehensive contextual information within these tokens. Finally, a de-tokenizer and HiFi-GAN restore them into enhanced speech. Experimental results demonstrate that SELM achieves comparable performance in objective metrics and superior subjective perception results. Our demos are available.",

keywords = "generative model, language models, speech enhancement, staged approach",

author = "Ziqian Wang and Xinfa Zhu and Zihan Zhang and Lv, {Yuan Jun} and Ning Jiang and Guoqing Zhao and Lei Xie",

note = "Publisher Copyright: {\textcopyright}2024 IEEE.; 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 ; Conference date: 14-04-2024 Through 19-04-2024",

year = "2024",

doi = "10.1109/ICASSP48485.2024.10447464",

language = "英语",

pages = "11561--11565",

journal = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

issn = "1520-6149",

}

TY - JOUR

T1 - SELM

T2 - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024

AU - Wang, Ziqian

AU - Zhu, Xinfa

AU - Zhang, Zihan

AU - Lv, Yuan Jun

AU - Jiang, Ning

AU - Zhao, Guoqing

AU - Xie, Lei

PY - 2024

Y1 - 2024

N2 - Language models (LMs) have recently shown superior performances in various speech generation tasks, demonstrating their powerful ability for semantic context modeling. Given the intrinsic similarity between speech generation and speech enhancement, harnessing semantic information is advantageous for speech enhancement tasks. In light of this, we propose SELM, a novel speech enhancement paradigm that integrates discrete tokens and leverages language models. SELM comprises three stages: encoding, modeling, and decoding. We transform continuous waveform signals into discrete tokens using pre-trained self-supervised learning (SSL) models and a k-means tokenizer. Language models then capture comprehensive contextual information within these tokens. Finally, a de-tokenizer and HiFi-GAN restore them into enhanced speech. Experimental results demonstrate that SELM achieves comparable performance in objective metrics and superior subjective perception results. Our demos are available.

AB - Language models (LMs) have recently shown superior performances in various speech generation tasks, demonstrating their powerful ability for semantic context modeling. Given the intrinsic similarity between speech generation and speech enhancement, harnessing semantic information is advantageous for speech enhancement tasks. In light of this, we propose SELM, a novel speech enhancement paradigm that integrates discrete tokens and leverages language models. SELM comprises three stages: encoding, modeling, and decoding. We transform continuous waveform signals into discrete tokens using pre-trained self-supervised learning (SSL) models and a k-means tokenizer. Language models then capture comprehensive contextual information within these tokens. Finally, a de-tokenizer and HiFi-GAN restore them into enhanced speech. Experimental results demonstrate that SELM achieves comparable performance in objective metrics and superior subjective perception results. Our demos are available.

KW - generative model

KW - language models

KW - speech enhancement

KW - staged approach

UR - http://www.scopus.com/inward/record.url?scp=105001508330&partnerID=8YFLogxK

U2 - 10.1109/ICASSP48485.2024.10447464

DO - 10.1109/ICASSP48485.2024.10447464

M3 - 会议文章

AN - SCOPUS:105001508330

SN - 1520-6149

SP - 11561

EP - 11565

JO - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

JF - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

Y2 - 14 April 2024 through 19 April 2024

ER -

SELM: SPEECH ENHANCEMENT USING DISCRETE TOKENS AND LANGUAGE MODELS

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this