A waveform representation framework for high-quality statistical parametric speech synthesis

Bo Fan; Siu Wa Lee; Xiaohai Tian; Lei Xie; Minghui Dong

doi:10.1109/APSIPA.2015.7415327

A waveform representation framework for high-quality statistical parametric speech synthesis

Bo Fan, Siu Wa Lee, Xiaohai Tian, Lei Xie, Minghui Dong

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

3 Scopus citations

Abstract

State-of-the-art statistical parametric speech synthesis (SPSS) generally uses a vocoder to represent speech signals and parameterize them into features for subsequent modeling. Magnitude spectrum has been a dominant feature over the years. Although perceptual studies have shown that phase spectrum is essential to the quality of synthesized speech, it is often ignored by using a minimum phase filter during synthesis and the speech quality suffers. To bypass this bottleneck in vocoded speech, this paper proposes a phase-embedded waveform representation framework and establishes a magnitude-phase joint modeling platform for high-quality SPSS. Our experiments on waveform reconstruction show that the performance is better than that of the widely-used STRAIGHT. Furthermore, the proposed modeling and synthesis platform outperforms a leading-edge, vocoded, deep bidirectional long short-term memory recurrent neural network (DBLSTM-RNN)-based baseline system in various objective evaluation metrics conducted.

Original language	English
Title of host publication	2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	530-536
Number of pages	7
ISBN (Electronic)	9789881476807
DOIs	https://doi.org/10.1109/APSIPA.2015.7415327
State	Published - 19 Feb 2016
Event	2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015 - Hong Kong, Hong Kong Duration: 16 Dec 2015 → 19 Dec 2015

Publication series

Name	2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015

Conference

Conference	2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015
Country/Territory	Hong Kong
City	Hong Kong
Period	16/12/15 → 19/12/15

Access to Document

10.1109/APSIPA.2015.7415327

Cite this

Fan, B., Lee, S. W., Tian, X., Xie, L., & Dong, M. (2016). A waveform representation framework for high-quality statistical parametric speech synthesis. In 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015 (pp. 530-536). Article 7415327 (2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/APSIPA.2015.7415327

Fan, Bo ; Lee, Siu Wa ; Tian, Xiaohai et al. / A waveform representation framework for high-quality statistical parametric speech synthesis. 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015. Institute of Electrical and Electronics Engineers Inc., 2016. pp. 530-536 (2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015).

@inproceedings{ea812decd4fc430080f3e3b657deee2f,

title = "A waveform representation framework for high-quality statistical parametric speech synthesis",

abstract = "State-of-the-art statistical parametric speech synthesis (SPSS) generally uses a vocoder to represent speech signals and parameterize them into features for subsequent modeling. Magnitude spectrum has been a dominant feature over the years. Although perceptual studies have shown that phase spectrum is essential to the quality of synthesized speech, it is often ignored by using a minimum phase filter during synthesis and the speech quality suffers. To bypass this bottleneck in vocoded speech, this paper proposes a phase-embedded waveform representation framework and establishes a magnitude-phase joint modeling platform for high-quality SPSS. Our experiments on waveform reconstruction show that the performance is better than that of the widely-used STRAIGHT. Furthermore, the proposed modeling and synthesis platform outperforms a leading-edge, vocoded, deep bidirectional long short-term memory recurrent neural network (DBLSTM-RNN)-based baseline system in various objective evaluation metrics conducted.",

author = "Bo Fan and Lee, {Siu Wa} and Xiaohai Tian and Lei Xie and Minghui Dong",

note = "Publisher Copyright: {\textcopyright} 2015 Asia-Pacific Signal and Information Processing Association.; 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015 ; Conference date: 16-12-2015 Through 19-12-2015",

year = "2016",

month = feb,

day = "19",

doi = "10.1109/APSIPA.2015.7415327",

language = "英语",

series = "2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "530--536",

booktitle = "2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015",

}

Fan, B, Lee, SW, Tian, X, Xie, L & Dong, M 2016, A waveform representation framework for high-quality statistical parametric speech synthesis. in 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015., 7415327, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015, Institute of Electrical and Electronics Engineers Inc., pp. 530-536, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015, Hong Kong, Hong Kong, 16/12/15. https://doi.org/10.1109/APSIPA.2015.7415327

A waveform representation framework for high-quality statistical parametric speech synthesis. / Fan, Bo; Lee, Siu Wa; Tian, Xiaohai et al.
2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015. Institute of Electrical and Electronics Engineers Inc., 2016. p. 530-536 7415327 (2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - A waveform representation framework for high-quality statistical parametric speech synthesis

AU - Fan, Bo

AU - Lee, Siu Wa

AU - Tian, Xiaohai

AU - Xie, Lei

AU - Dong, Minghui

PY - 2016/2/19

Y1 - 2016/2/19

N2 - State-of-the-art statistical parametric speech synthesis (SPSS) generally uses a vocoder to represent speech signals and parameterize them into features for subsequent modeling. Magnitude spectrum has been a dominant feature over the years. Although perceptual studies have shown that phase spectrum is essential to the quality of synthesized speech, it is often ignored by using a minimum phase filter during synthesis and the speech quality suffers. To bypass this bottleneck in vocoded speech, this paper proposes a phase-embedded waveform representation framework and establishes a magnitude-phase joint modeling platform for high-quality SPSS. Our experiments on waveform reconstruction show that the performance is better than that of the widely-used STRAIGHT. Furthermore, the proposed modeling and synthesis platform outperforms a leading-edge, vocoded, deep bidirectional long short-term memory recurrent neural network (DBLSTM-RNN)-based baseline system in various objective evaluation metrics conducted.

AB - State-of-the-art statistical parametric speech synthesis (SPSS) generally uses a vocoder to represent speech signals and parameterize them into features for subsequent modeling. Magnitude spectrum has been a dominant feature over the years. Although perceptual studies have shown that phase spectrum is essential to the quality of synthesized speech, it is often ignored by using a minimum phase filter during synthesis and the speech quality suffers. To bypass this bottleneck in vocoded speech, this paper proposes a phase-embedded waveform representation framework and establishes a magnitude-phase joint modeling platform for high-quality SPSS. Our experiments on waveform reconstruction show that the performance is better than that of the widely-used STRAIGHT. Furthermore, the proposed modeling and synthesis platform outperforms a leading-edge, vocoded, deep bidirectional long short-term memory recurrent neural network (DBLSTM-RNN)-based baseline system in various objective evaluation metrics conducted.

UR - http://www.scopus.com/inward/record.url?scp=84986212974&partnerID=8YFLogxK

U2 - 10.1109/APSIPA.2015.7415327

DO - 10.1109/APSIPA.2015.7415327

M3 - 会议稿件

AN - SCOPUS:84986212974

T3 - 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015

SP - 530

EP - 536

BT - 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015

Y2 - 16 December 2015 through 19 December 2015

ER -

Fan B, Lee SW, Tian X, Xie L, Dong M. A waveform representation framework for high-quality statistical parametric speech synthesis. In 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015. Institute of Electrical and Electronics Engineers Inc. 2016. p. 530-536. 7415327. (2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015). doi: 10.1109/APSIPA.2015.7415327

A waveform representation framework for high-quality statistical parametric speech synthesis

Abstract

Publication series

Conference

Access to Document

Other files and links

Fingerprint

Cite this