A waveform representation framework for high-quality statistical parametric speech synthesis

Bo Fan, Siu Wa Lee, Xiaohai Tian, Lei Xie, Minghui Dong

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

State-of-the-art statistical parametric speech synthesis (SPSS) generally uses a vocoder to represent speech signals and parameterize them into features for subsequent modeling. Magnitude spectrum has been a dominant feature over the years. Although perceptual studies have shown that phase spectrum is essential to the quality of synthesized speech, it is often ignored by using a minimum phase filter during synthesis and the speech quality suffers. To bypass this bottleneck in vocoded speech, this paper proposes a phase-embedded waveform representation framework and establishes a magnitude-phase joint modeling platform for high-quality SPSS. Our experiments on waveform reconstruction show that the performance is better than that of the widely-used STRAIGHT. Furthermore, the proposed modeling and synthesis platform outperforms a leading-edge, vocoded, deep bidirectional long short-term memory recurrent neural network (DBLSTM-RNN)-based baseline system in various objective evaluation metrics conducted.

Original languageEnglish
Title of host publication2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages530-536
Number of pages7
ISBN (Electronic)9789881476807
DOIs
StatePublished - 19 Feb 2016
Event2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015 - Hong Kong, Hong Kong
Duration: 16 Dec 201519 Dec 2015

Publication series

Name2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015

Conference

Conference2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015
Country/TerritoryHong Kong
CityHong Kong
Period16/12/1519/12/15

Fingerprint

Dive into the research topics of 'A waveform representation framework for high-quality statistical parametric speech synthesis'. Together they form a unique fingerprint.

Cite this