SELM: SPEECH ENHANCEMENT USING DISCRETE TOKENS AND LANGUAGE MODELS

Ziqian Wang; Xinfa Zhu; Zihan Zhang; Yuan Jun Lv; Ning Jiang; Guoqing Zhao; Lei Xie

doi:10.1109/ICASSP48485.2024.10447464

SELM: SPEECH ENHANCEMENT USING DISCRETE TOKENS AND LANGUAGE MODELS

Ziqian Wang, Xinfa Zhu, Zihan Zhang, Yuan Jun Lv, Ning Jiang, Guoqing Zhao, Lei Xie

计算机学院

科研成果: 期刊稿件 › 会议文章 › 同行评审

5 引用（Scopus）

摘要

Language models (LMs) have recently shown superior performances in various speech generation tasks, demonstrating their powerful ability for semantic context modeling. Given the intrinsic similarity between speech generation and speech enhancement, harnessing semantic information is advantageous for speech enhancement tasks. In light of this, we propose SELM, a novel speech enhancement paradigm that integrates discrete tokens and leverages language models. SELM comprises three stages: encoding, modeling, and decoding. We transform continuous waveform signals into discrete tokens using pre-trained self-supervised learning (SSL) models and a k-means tokenizer. Language models then capture comprehensive contextual information within these tokens. Finally, a de-tokenizer and HiFi-GAN restore them into enhanced speech. Experimental results demonstrate that SELM achieves comparable performance in objective metrics and superior subjective perception results. Our demos are available.

源语言	英语
页（从-至）	11561-11565
页数	5
期刊	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
DOI	https://doi.org/10.1109/ICASSP48485.2024.10447464
出版状态	已出版 - 2024
活动	2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Seoul, 韩国期限: 14 4月 2024 → 19 4月 2024

访问文件

10.1109/ICASSP48485.2024.10447464

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{73cbf0ee316d4dce9ff9a6df9058d5ed,

title = "SELM: SPEECH ENHANCEMENT USING DISCRETE TOKENS AND LANGUAGE MODELS",

abstract = "Language models (LMs) have recently shown superior performances in various speech generation tasks, demonstrating their powerful ability for semantic context modeling. Given the intrinsic similarity between speech generation and speech enhancement, harnessing semantic information is advantageous for speech enhancement tasks. In light of this, we propose SELM, a novel speech enhancement paradigm that integrates discrete tokens and leverages language models. SELM comprises three stages: encoding, modeling, and decoding. We transform continuous waveform signals into discrete tokens using pre-trained self-supervised learning (SSL) models and a k-means tokenizer. Language models then capture comprehensive contextual information within these tokens. Finally, a de-tokenizer and HiFi-GAN restore them into enhanced speech. Experimental results demonstrate that SELM achieves comparable performance in objective metrics and superior subjective perception results. Our demos are available.",

keywords = "generative model, language models, speech enhancement, staged approach",

author = "Ziqian Wang and Xinfa Zhu and Zihan Zhang and Lv, {Yuan Jun} and Ning Jiang and Guoqing Zhao and Lei Xie",

note = "Publisher Copyright: {\textcopyright}2024 IEEE.; 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 ; Conference date: 14-04-2024 Through 19-04-2024",

year = "2024",

doi = "10.1109/ICASSP48485.2024.10447464",

language = "英语",

pages = "11561--11565",

journal = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

issn = "1520-6149",

}

TY - JOUR

T1 - SELM

T2 - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024

AU - Wang, Ziqian

AU - Zhu, Xinfa

AU - Zhang, Zihan

AU - Lv, Yuan Jun

AU - Jiang, Ning

AU - Zhao, Guoqing

AU - Xie, Lei

PY - 2024

Y1 - 2024

N2 - Language models (LMs) have recently shown superior performances in various speech generation tasks, demonstrating their powerful ability for semantic context modeling. Given the intrinsic similarity between speech generation and speech enhancement, harnessing semantic information is advantageous for speech enhancement tasks. In light of this, we propose SELM, a novel speech enhancement paradigm that integrates discrete tokens and leverages language models. SELM comprises three stages: encoding, modeling, and decoding. We transform continuous waveform signals into discrete tokens using pre-trained self-supervised learning (SSL) models and a k-means tokenizer. Language models then capture comprehensive contextual information within these tokens. Finally, a de-tokenizer and HiFi-GAN restore them into enhanced speech. Experimental results demonstrate that SELM achieves comparable performance in objective metrics and superior subjective perception results. Our demos are available.

AB - Language models (LMs) have recently shown superior performances in various speech generation tasks, demonstrating their powerful ability for semantic context modeling. Given the intrinsic similarity between speech generation and speech enhancement, harnessing semantic information is advantageous for speech enhancement tasks. In light of this, we propose SELM, a novel speech enhancement paradigm that integrates discrete tokens and leverages language models. SELM comprises three stages: encoding, modeling, and decoding. We transform continuous waveform signals into discrete tokens using pre-trained self-supervised learning (SSL) models and a k-means tokenizer. Language models then capture comprehensive contextual information within these tokens. Finally, a de-tokenizer and HiFi-GAN restore them into enhanced speech. Experimental results demonstrate that SELM achieves comparable performance in objective metrics and superior subjective perception results. Our demos are available.

KW - generative model

KW - language models

KW - speech enhancement

KW - staged approach

UR - http://www.scopus.com/inward/record.url?scp=105001508330&partnerID=8YFLogxK

U2 - 10.1109/ICASSP48485.2024.10447464

DO - 10.1109/ICASSP48485.2024.10447464

M3 - 会议文章

AN - SCOPUS:105001508330

SN - 1520-6149

SP - 11561

EP - 11565

JO - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

JF - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

Y2 - 14 April 2024 through 19 April 2024

ER -

SELM: SPEECH ENHANCEMENT USING DISCRETE TOKENS AND LANGUAGE MODELS

摘要

访问文件

其它文件与链接

指纹

引用此