Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS

Kun Song; Jian Cong; Xinsheng Wang; Yongmao Zhang; Lei Xie; Ning Jiang; Haiying Wu

doi:10.1109/ISCSLP57327.2022.10038120

Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS

Kun Song, Jian Cong, Xinsheng Wang, Yongmao Zhang, Lei Xie, Ning Jiang, Haiying Wu

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

4 引用（Scopus）

摘要

In current two-stage neural text-to-speech (TTS) paradigm, it is ideal to have a universal neural vocoder, once trained, which is robust to imperfect mel-spectrogram predicted from the acoustic model. To this end, we propose Robust MelGAN vocoder by solving the original multi-band MelGAN's metallic sound problem and increasing its generalization ability. Specifically, we introduce a fine-grained network dropout strategy to the generator. With a specifically designed over-smooth handler which separates speech signal intro periodic and aperiodic components, we only perform network dropout to the aperodic components, which alleviates metallic sounding and maintains good speaker similarity. To further improve generalization ability, we introduce several data augmentation methods to augment fake data in the discriminator, including harmonic shift, harmonic noise and phase noise. Experiments show that Robust MelGAN can be used as a universal vocoder, significantly improving sound quality in TTS systems built on various types of data. 11Audio samples are available at https://RobustMelGAN.github.io/RobustMelGAN/

源语言	英语
主期刊名	2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022
编辑	Kong Aik Lee, Hung-yi Lee, Yanfeng Lu, Minghui Dong
出版商	Institute of Electrical and Electronics Engineers Inc.
页	71-75
页数	5
ISBN（电子版）	9798350397963
DOI	https://doi.org/10.1109/ISCSLP57327.2022.10038120
出版状态	已出版 - 2022
活动	13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022 - Singapore, 新加坡期限: 11 12月 2022 → 14 12月 2022

出版系列

姓名	2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022

会议

会议	13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022
国家/地区	新加坡
市	Singapore
时期	11/12/22 → 14/12/22

访问文件

10.1109/ISCSLP57327.2022.10038120

其它文件与链接

链接到 Scopus 的出版物

引用此

Song, K., Cong, J., Wang, X., Zhang, Y., Xie, L., Jiang, N., & Wu, H. (2022). Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS. 在 K. A. Lee, H. Lee, Y. Lu, & M. Dong (编辑), 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022 (页码 71-75). (2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ISCSLP57327.2022.10038120

Song, Kun ; Cong, Jian ; Wang, Xinsheng 等. / Robust MelGAN : A robust universal neural vocoder for high-fidelity TTS. 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022. 编辑 / Kong Aik Lee ; Hung-yi Lee ; Yanfeng Lu ; Minghui Dong. Institute of Electrical and Electronics Engineers Inc., 2022. 页码 71-75 (2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022).

@inproceedings{3a2494f8f12841f089dd46d8da428aca,

title = "Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS",

abstract = "In current two-stage neural text-to-speech (TTS) paradigm, it is ideal to have a universal neural vocoder, once trained, which is robust to imperfect mel-spectrogram predicted from the acoustic model. To this end, we propose Robust MelGAN vocoder by solving the original multi-band MelGAN's metallic sound problem and increasing its generalization ability. Specifically, we introduce a fine-grained network dropout strategy to the generator. With a specifically designed over-smooth handler which separates speech signal intro periodic and aperiodic components, we only perform network dropout to the aperodic components, which alleviates metallic sounding and maintains good speaker similarity. To further improve generalization ability, we introduce several data augmentation methods to augment fake data in the discriminator, including harmonic shift, harmonic noise and phase noise. Experiments show that Robust MelGAN can be used as a universal vocoder, significantly improving sound quality in TTS systems built on various types of data. 11Audio samples are available at https://RobustMelGAN.github.io/RobustMelGAN/",

keywords = "data augmentation, generative adversarial network, text-to speech, universal vocoder",

author = "Kun Song and Jian Cong and Xinsheng Wang and Yongmao Zhang and Lei Xie and Ning Jiang and Haiying Wu",

note = "Publisher Copyright: {\textcopyright} 2022 IEEE.; 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022 ; Conference date: 11-12-2022 Through 14-12-2022",

year = "2022",

doi = "10.1109/ISCSLP57327.2022.10038120",

language = "英语",

series = "2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "71--75",

editor = "Lee, {Kong Aik} and Hung-yi Lee and Yanfeng Lu and Minghui Dong",

booktitle = "2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022",

}

Song, K, Cong, J, Wang, X, Zhang, Y, Xie, L, Jiang, N & Wu, H 2022, Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS. 在 KA Lee, H Lee, Y Lu & M Dong (编辑), 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022. 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022, Institute of Electrical and Electronics Engineers Inc., 页码 71-75, 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022, Singapore, 新加坡, 11/12/22. https://doi.org/10.1109/ISCSLP57327.2022.10038120

Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS. / Song, Kun; Cong, Jian; Wang, Xinsheng 等.
2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022. 编辑 / Kong Aik Lee; Hung-yi Lee; Yanfeng Lu; Minghui Dong. Institute of Electrical and Electronics Engineers Inc., 2022. 页码 71-75 (2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Robust MelGAN

T2 - 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022

AU - Song, Kun

AU - Cong, Jian

AU - Wang, Xinsheng

AU - Zhang, Yongmao

AU - Xie, Lei

AU - Jiang, Ning

AU - Wu, Haiying

PY - 2022

Y1 - 2022

N2 - In current two-stage neural text-to-speech (TTS) paradigm, it is ideal to have a universal neural vocoder, once trained, which is robust to imperfect mel-spectrogram predicted from the acoustic model. To this end, we propose Robust MelGAN vocoder by solving the original multi-band MelGAN's metallic sound problem and increasing its generalization ability. Specifically, we introduce a fine-grained network dropout strategy to the generator. With a specifically designed over-smooth handler which separates speech signal intro periodic and aperiodic components, we only perform network dropout to the aperodic components, which alleviates metallic sounding and maintains good speaker similarity. To further improve generalization ability, we introduce several data augmentation methods to augment fake data in the discriminator, including harmonic shift, harmonic noise and phase noise. Experiments show that Robust MelGAN can be used as a universal vocoder, significantly improving sound quality in TTS systems built on various types of data. 11Audio samples are available at https://RobustMelGAN.github.io/RobustMelGAN/

AB - In current two-stage neural text-to-speech (TTS) paradigm, it is ideal to have a universal neural vocoder, once trained, which is robust to imperfect mel-spectrogram predicted from the acoustic model. To this end, we propose Robust MelGAN vocoder by solving the original multi-band MelGAN's metallic sound problem and increasing its generalization ability. Specifically, we introduce a fine-grained network dropout strategy to the generator. With a specifically designed over-smooth handler which separates speech signal intro periodic and aperiodic components, we only perform network dropout to the aperodic components, which alleviates metallic sounding and maintains good speaker similarity. To further improve generalization ability, we introduce several data augmentation methods to augment fake data in the discriminator, including harmonic shift, harmonic noise and phase noise. Experiments show that Robust MelGAN can be used as a universal vocoder, significantly improving sound quality in TTS systems built on various types of data. 11Audio samples are available at https://RobustMelGAN.github.io/RobustMelGAN/

KW - data augmentation

KW - generative adversarial network

KW - text-to speech

KW - universal vocoder

UR - http://www.scopus.com/inward/record.url?scp=85148616847&partnerID=8YFLogxK

U2 - 10.1109/ISCSLP57327.2022.10038120

DO - 10.1109/ISCSLP57327.2022.10038120

M3 - 会议稿件

AN - SCOPUS:85148616847

T3 - 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022

SP - 71

EP - 75

BT - 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022

A2 - Lee, Kong Aik

A2 - Lee, Hung-yi

A2 - Lu, Yanfeng

A2 - Dong, Minghui

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 11 December 2022 through 14 December 2022

ER -

Song K, Cong J, Wang X, Zhang Y, Xie L, Jiang N 等. Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS. 在 Lee KA, Lee H, Lu Y, Dong M, 编辑, 2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022. Institute of Electrical and Electronics Engineers Inc. 2022. 页码 71-75. (2022 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022). doi: 10.1109/ISCSLP57327.2022.10038120

Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此