Multi-Band Melgan: Faster Waveform Generation for High-Quality Text-To-Speech

Geng Yang, Shan Yang, Kai Liu, Peng Fang, Wei Chen, Lei Xie

科研成果: 书/报告/会议事项章节会议稿件同行评审

131 引用 (Scopus)

摘要

In this paper, we propose multi-band MelGAN, a much faster waveform generation model targeting to high-quality text-to-speech. Specifically, we improve the original MelGAN by the following aspects. First, we increase the receptive field of the generator, which is proven to be beneficial to speech generation. Second, we substitute the feature matching loss with the multi-resolution STFT loss to better measure the difference between fake and real speech. Together with pre-training, this improvement leads to both better quality and better training stability. More importantly, we extend MelGAN with multi-band processing: the generator takes mel-spectrograms as input and produces sub-band signals which are subsequently summed back to full-band signals as discriminator input. The proposed multi-band MelGAN has achieved high MOS of 4.34 and 4.22 in waveform generation and TTS, respectively. With only 1.91M parameters, our model effectively reduces the total computational complexity of the original MelGAN from 5.85 to 0.95 GFLOPS. Our Pytorch implementation can achieve a real-time factor of 0.03 on CPU without hardware specific optimization.

源语言英语
主期刊名2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings
出版商Institute of Electrical and Electronics Engineers Inc.
492-498
页数7
ISBN(电子版)9781728170664
DOI
出版状态已出版 - 19 1月 2021
活动2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Virtual, Shenzhen, 中国
期限: 19 1月 202122 1月 2021

出版系列

姓名2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings

会议

会议2021 IEEE Spoken Language Technology Workshop, SLT 2021
国家/地区中国
Virtual, Shenzhen
时期19/01/2122/01/21

指纹

探究 'Multi-Band Melgan: Faster Waveform Generation for High-Quality Text-To-Speech' 的科研主题。它们共同构成独一无二的指纹。

引用此