Conversational End-to-End TTS for Voice Agents

Haohan Guo, Shaofei Zhang, Frank K. Soong, Lei He, Lei Xie

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

51 Scopus citations

Abstract

End-to-end neural TTS has achieved excellent performance on reading style speech synthesis. However, it is still a challenge to build a high-quality conversational TTS due to the limitations of corpus and modeling capability. This study aims at building a conversational TTS for a voice agent under sequence to sequence modeling framework. We firstly construct a spontaneous conversational speech corpus well designed for the voice agent with a new recording scheme ensuring both recording quality and conversational speaking style. Secondly, we propose a conversation context-aware end-to-end TTS approach that employs an auxiliary encoder and a conversational context encoder to specifically reinforce the information about the current utterance and its context in a conversation as well. Experimental results show that the proposed approach produces more natural prosody in accordance with the conversational context, with significant preference gains at both utterance-level and conversation-level. Moreover, we find that the model has the ability to express some spontaneous behaviors like fillers and repeated words, which makes the conversational speaking style more realistic.

Original languageEnglish
Title of host publication2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages403-409
Number of pages7
ISBN (Electronic)9781728170664
DOIs
StatePublished - 19 Jan 2021
Event2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Virtual, Shenzhen, China
Duration: 19 Jan 202122 Jan 2021

Publication series

Name2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings

Conference

Conference2021 IEEE Spoken Language Technology Workshop, SLT 2021
Country/TerritoryChina
CityVirtual, Shenzhen
Period19/01/2122/01/21

Keywords

  • Conversational TTS
  • End-to-End
  • Speech Corpus
  • Text-to-Speech
  • Voice Agent

Fingerprint

Dive into the research topics of 'Conversational End-to-End TTS for Voice Agents'. Together they form a unique fingerprint.

Cite this