Skip to main navigation Skip to search Skip to main content

Multi-Band Melgan: Faster Waveform Generation for High-Quality Text-To-Speech

  • Geng Yang
  • , Shan Yang
  • , Kai Liu
  • , Peng Fang
  • , Wei Chen
  • , Lei Xie
  • Northwestern Polytechnical University Xian
  • Sohu, Inc.

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

162 Scopus citations

Abstract

In this paper, we propose multi-band MelGAN, a much faster waveform generation model targeting to high-quality text-to-speech. Specifically, we improve the original MelGAN by the following aspects. First, we increase the receptive field of the generator, which is proven to be beneficial to speech generation. Second, we substitute the feature matching loss with the multi-resolution STFT loss to better measure the difference between fake and real speech. Together with pre-training, this improvement leads to both better quality and better training stability. More importantly, we extend MelGAN with multi-band processing: the generator takes mel-spectrograms as input and produces sub-band signals which are subsequently summed back to full-band signals as discriminator input. The proposed multi-band MelGAN has achieved high MOS of 4.34 and 4.22 in waveform generation and TTS, respectively. With only 1.91M parameters, our model effectively reduces the total computational complexity of the original MelGAN from 5.85 to 0.95 GFLOPS. Our Pytorch implementation can achieve a real-time factor of 0.03 on CPU without hardware specific optimization.

Original languageEnglish
Title of host publication2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages492-498
Number of pages7
ISBN (Electronic)9781728170664
DOIs
StatePublished - 19 Jan 2021
Event2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Virtual, Online, China
Duration: 19 Jan 202122 Jan 2021

Publication series

Name2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings

Conference

Conference2021 IEEE Spoken Language Technology Workshop, SLT 2021
Country/TerritoryChina
CityVirtual, Online
Period19/01/2122/01/21

Keywords

  • generative adversarial networks
  • neural vocoder
  • speech synthesis
  • text-to-speech

Fingerprint

Dive into the research topics of 'Multi-Band Melgan: Faster Waveform Generation for High-Quality Text-To-Speech'. Together they form a unique fingerprint.

Cite this