Controlling Expressivity using Input Codes in Neural Network based TTS

Xiaolian Zhu, Lei Xie, Xiao Chen, Xiaoyan Lou, Xuan Zhu, Xingjun Tan

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

This paper presents a study on the use of input codes in the neural network acoustic modeling for expressive TTS. Specifically, we use different kinds of input codes, augmented with the linguistic features, as the input of a BLSTM-based acoustic model, to control the expressivity of the synthesized speech. The input codes, in one-hot representation, include dialogue code, sentiment code and sentence position code. The dialogue code indicates whether the text is a dialogue or narration in an audiobook story. The sentiment code is obtained from a sentiment analysis tool, which labels each sentence as positive, negative and neutral. The sentence position code indicates the position of the sentence in the paragraph. We believe these codes are highly related to the expressiveness of the audiobook speech. Experiments on the data from the Blizzard Challenge 2017 demonstrate the effectiveness of the use of input codes in the neural network approach for expressive TTS.

Original languageEnglish
Title of host publication2018 1st Asian Conference on Affective Computing and Intelligent Interaction, ACII Asia 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781538653111
DOIs
StatePublished - 21 Sep 2018
Event1st Asian Conference on Affective Computing and Intelligent Interaction, ACII Asia 2018 - Beijing, China
Duration: 20 May 201822 May 2018

Publication series

Name2018 1st Asian Conference on Affective Computing and Intelligent Interaction, ACII Asia 2018

Conference

Conference1st Asian Conference on Affective Computing and Intelligent Interaction, ACII Asia 2018
Country/TerritoryChina
CityBeijing
Period20/05/1822/05/18

Keywords

  • BLSTM
  • Neural network
  • Speech synthesis
  • Text-to-speech

Fingerprint

Dive into the research topics of 'Controlling Expressivity using Input Codes in Neural Network based TTS'. Together they form a unique fingerprint.

Cite this