Multi-Band Melgan: Faster Waveform Generation for High-Quality Text-To-Speech

Geng Yang; Shan Yang; Kai Liu; Peng Fang; Wei Chen; Lei Xie

doi:10.1109/SLT48900.2021.9383551

Multi-Band Melgan: Faster Waveform Generation for High-Quality Text-To-Speech

Geng Yang, Shan Yang, Kai Liu, Peng Fang, Wei Chen, Lei Xie

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

131 引用（Scopus）

摘要

In this paper, we propose multi-band MelGAN, a much faster waveform generation model targeting to high-quality text-to-speech. Specifically, we improve the original MelGAN by the following aspects. First, we increase the receptive field of the generator, which is proven to be beneficial to speech generation. Second, we substitute the feature matching loss with the multi-resolution STFT loss to better measure the difference between fake and real speech. Together with pre-training, this improvement leads to both better quality and better training stability. More importantly, we extend MelGAN with multi-band processing: the generator takes mel-spectrograms as input and produces sub-band signals which are subsequently summed back to full-band signals as discriminator input. The proposed multi-band MelGAN has achieved high MOS of 4.34 and 4.22 in waveform generation and TTS, respectively. With only 1.91M parameters, our model effectively reduces the total computational complexity of the original MelGAN from 5.85 to 0.95 GFLOPS. Our Pytorch implementation can achieve a real-time factor of 0.03 on CPU without hardware specific optimization.

源语言	英语
主期刊名	2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings
出版商	Institute of Electrical and Electronics Engineers Inc.
页	492-498
页数	7
ISBN（电子版）	9781728170664
DOI	https://doi.org/10.1109/SLT48900.2021.9383551
出版状态	已出版 - 19 1月 2021
活动	2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Virtual, Shenzhen, 中国期限: 19 1月 2021 → 22 1月 2021

出版系列

姓名	2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings

会议

会议	2021 IEEE Spoken Language Technology Workshop, SLT 2021
国家/地区	中国
市	Virtual, Shenzhen
时期	19/01/21 → 22/01/21

访问文件

10.1109/SLT48900.2021.9383551

其它文件与链接

链接到 Scopus 的出版物

引用此

Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., & Xie, L. (2021). Multi-Band Melgan: Faster Waveform Generation for High-Quality Text-To-Speech. 在 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings (页码 492-498). 文章 9383551 (2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/SLT48900.2021.9383551

@inproceedings{b888fd2f19ed4cf5a19716553debcd61,

title = "Multi-Band Melgan: Faster Waveform Generation for High-Quality Text-To-Speech",

abstract = "In this paper, we propose multi-band MelGAN, a much faster waveform generation model targeting to high-quality text-to-speech. Specifically, we improve the original MelGAN by the following aspects. First, we increase the receptive field of the generator, which is proven to be beneficial to speech generation. Second, we substitute the feature matching loss with the multi-resolution STFT loss to better measure the difference between fake and real speech. Together with pre-training, this improvement leads to both better quality and better training stability. More importantly, we extend MelGAN with multi-band processing: the generator takes mel-spectrograms as input and produces sub-band signals which are subsequently summed back to full-band signals as discriminator input. The proposed multi-band MelGAN has achieved high MOS of 4.34 and 4.22 in waveform generation and TTS, respectively. With only 1.91M parameters, our model effectively reduces the total computational complexity of the original MelGAN from 5.85 to 0.95 GFLOPS. Our Pytorch implementation can achieve a real-time factor of 0.03 on CPU without hardware specific optimization.",

keywords = "generative adversarial networks, neural vocoder, speech synthesis, text-to-speech",

author = "Geng Yang and Shan Yang and Kai Liu and Peng Fang and Wei Chen and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2021 IEEE.; 2021 IEEE Spoken Language Technology Workshop, SLT 2021 ; Conference date: 19-01-2021 Through 22-01-2021",

year = "2021",

month = jan,

day = "19",

doi = "10.1109/SLT48900.2021.9383551",

language = "英语",

series = "2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "492--498",

booktitle = "2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings",

}

Yang, G, Yang, S, Liu, K, Fang, P, Chen, W & Xie, L 2021, Multi-Band Melgan: Faster Waveform Generation for High-Quality Text-To-Speech. 在 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings., 9383551, 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings, Institute of Electrical and Electronics Engineers Inc., 页码 492-498, 2021 IEEE Spoken Language Technology Workshop, SLT 2021, Virtual, Shenzhen, 中国, 19/01/21. https://doi.org/10.1109/SLT48900.2021.9383551

Multi-Band Melgan: Faster Waveform Generation for High-Quality Text-To-Speech. / Yang, Geng; Yang, Shan; Liu, Kai 等.
2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2021. 页码 492-498 9383551 (2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Multi-Band Melgan

T2 - 2021 IEEE Spoken Language Technology Workshop, SLT 2021

AU - Yang, Geng

AU - Yang, Shan

AU - Liu, Kai

AU - Fang, Peng

AU - Chen, Wei

AU - Xie, Lei

PY - 2021/1/19

Y1 - 2021/1/19

N2 - In this paper, we propose multi-band MelGAN, a much faster waveform generation model targeting to high-quality text-to-speech. Specifically, we improve the original MelGAN by the following aspects. First, we increase the receptive field of the generator, which is proven to be beneficial to speech generation. Second, we substitute the feature matching loss with the multi-resolution STFT loss to better measure the difference between fake and real speech. Together with pre-training, this improvement leads to both better quality and better training stability. More importantly, we extend MelGAN with multi-band processing: the generator takes mel-spectrograms as input and produces sub-band signals which are subsequently summed back to full-band signals as discriminator input. The proposed multi-band MelGAN has achieved high MOS of 4.34 and 4.22 in waveform generation and TTS, respectively. With only 1.91M parameters, our model effectively reduces the total computational complexity of the original MelGAN from 5.85 to 0.95 GFLOPS. Our Pytorch implementation can achieve a real-time factor of 0.03 on CPU without hardware specific optimization.

AB - In this paper, we propose multi-band MelGAN, a much faster waveform generation model targeting to high-quality text-to-speech. Specifically, we improve the original MelGAN by the following aspects. First, we increase the receptive field of the generator, which is proven to be beneficial to speech generation. Second, we substitute the feature matching loss with the multi-resolution STFT loss to better measure the difference between fake and real speech. Together with pre-training, this improvement leads to both better quality and better training stability. More importantly, we extend MelGAN with multi-band processing: the generator takes mel-spectrograms as input and produces sub-band signals which are subsequently summed back to full-band signals as discriminator input. The proposed multi-band MelGAN has achieved high MOS of 4.34 and 4.22 in waveform generation and TTS, respectively. With only 1.91M parameters, our model effectively reduces the total computational complexity of the original MelGAN from 5.85 to 0.95 GFLOPS. Our Pytorch implementation can achieve a real-time factor of 0.03 on CPU without hardware specific optimization.

KW - generative adversarial networks

KW - neural vocoder

KW - speech synthesis

KW - text-to-speech

UR - http://www.scopus.com/inward/record.url?scp=85102233443&partnerID=8YFLogxK

U2 - 10.1109/SLT48900.2021.9383551

DO - 10.1109/SLT48900.2021.9383551

M3 - 会议稿件

AN - SCOPUS:85102233443

T3 - 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings

SP - 492

EP - 498

BT - 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 19 January 2021 through 22 January 2021

ER -

Yang G, Yang S, Liu K, Fang P, Chen W, Xie L. Multi-Band Melgan: Faster Waveform Generation for High-Quality Text-To-Speech. 在 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2021. 页码 492-498. 9383551. (2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings). doi: 10.1109/SLT48900.2021.9383551

Multi-Band Melgan: Faster Waveform Generation for High-Quality Text-To-Speech

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此