Conversational End-to-End TTS for Voice Agents

Haohan Guo; Shaofei Zhang; Frank K. Soong; Lei He; Lei Xie

doi:10.1109/SLT48900.2021.9383460

Conversational End-to-End TTS for Voice Agents

Haohan Guo, Shaofei Zhang, Frank K. Soong, Lei He, Lei Xie

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

51 引用（Scopus）

摘要

End-to-end neural TTS has achieved excellent performance on reading style speech synthesis. However, it is still a challenge to build a high-quality conversational TTS due to the limitations of corpus and modeling capability. This study aims at building a conversational TTS for a voice agent under sequence to sequence modeling framework. We firstly construct a spontaneous conversational speech corpus well designed for the voice agent with a new recording scheme ensuring both recording quality and conversational speaking style. Secondly, we propose a conversation context-aware end-to-end TTS approach that employs an auxiliary encoder and a conversational context encoder to specifically reinforce the information about the current utterance and its context in a conversation as well. Experimental results show that the proposed approach produces more natural prosody in accordance with the conversational context, with significant preference gains at both utterance-level and conversation-level. Moreover, we find that the model has the ability to express some spontaneous behaviors like fillers and repeated words, which makes the conversational speaking style more realistic.

源语言	英语
主期刊名	2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings
出版商	Institute of Electrical and Electronics Engineers Inc.
页	403-409
页数	7
ISBN（电子版）	9781728170664
DOI	https://doi.org/10.1109/SLT48900.2021.9383460
出版状态	已出版 - 19 1月 2021
活动	2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Virtual, Shenzhen, 中国期限: 19 1月 2021 → 22 1月 2021

出版系列

姓名	2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings

会议

会议	2021 IEEE Spoken Language Technology Workshop, SLT 2021
国家/地区	中国
市	Virtual, Shenzhen
时期	19/01/21 → 22/01/21

访问文件

10.1109/SLT48900.2021.9383460

其它文件与链接

链接到 Scopus 的出版物

引用此

Guo, H., Zhang, S., Soong, F. K., He, L., & Xie, L. (2021). Conversational End-to-End TTS for Voice Agents. 在 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings (页码 403-409). 文章 9383460 (2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/SLT48900.2021.9383460

@inproceedings{d9971aaeceee4732b7781b838a6bd612,

title = "Conversational End-to-End TTS for Voice Agents",

abstract = "End-to-end neural TTS has achieved excellent performance on reading style speech synthesis. However, it is still a challenge to build a high-quality conversational TTS due to the limitations of corpus and modeling capability. This study aims at building a conversational TTS for a voice agent under sequence to sequence modeling framework. We firstly construct a spontaneous conversational speech corpus well designed for the voice agent with a new recording scheme ensuring both recording quality and conversational speaking style. Secondly, we propose a conversation context-aware end-to-end TTS approach that employs an auxiliary encoder and a conversational context encoder to specifically reinforce the information about the current utterance and its context in a conversation as well. Experimental results show that the proposed approach produces more natural prosody in accordance with the conversational context, with significant preference gains at both utterance-level and conversation-level. Moreover, we find that the model has the ability to express some spontaneous behaviors like fillers and repeated words, which makes the conversational speaking style more realistic.",

keywords = "Conversational TTS, End-to-End, Speech Corpus, Text-to-Speech, Voice Agent",

author = "Haohan Guo and Shaofei Zhang and Soong, {Frank K.} and Lei He and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2021 IEEE.; 2021 IEEE Spoken Language Technology Workshop, SLT 2021 ; Conference date: 19-01-2021 Through 22-01-2021",

year = "2021",

month = jan,

day = "19",

doi = "10.1109/SLT48900.2021.9383460",

language = "英语",

series = "2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "403--409",

booktitle = "2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings",

}

Guo, H, Zhang, S, Soong, FK, He, L & Xie, L 2021, Conversational End-to-End TTS for Voice Agents. 在 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings., 9383460, 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings, Institute of Electrical and Electronics Engineers Inc., 页码 403-409, 2021 IEEE Spoken Language Technology Workshop, SLT 2021, Virtual, Shenzhen, 中国, 19/01/21. https://doi.org/10.1109/SLT48900.2021.9383460

Conversational End-to-End TTS for Voice Agents. / Guo, Haohan; Zhang, Shaofei; Soong, Frank K. 等.
2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2021. 页码 403-409 9383460 (2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Conversational End-to-End TTS for Voice Agents

AU - Guo, Haohan

AU - Zhang, Shaofei

AU - Soong, Frank K.

AU - He, Lei

AU - Xie, Lei

PY - 2021/1/19

Y1 - 2021/1/19

N2 - End-to-end neural TTS has achieved excellent performance on reading style speech synthesis. However, it is still a challenge to build a high-quality conversational TTS due to the limitations of corpus and modeling capability. This study aims at building a conversational TTS for a voice agent under sequence to sequence modeling framework. We firstly construct a spontaneous conversational speech corpus well designed for the voice agent with a new recording scheme ensuring both recording quality and conversational speaking style. Secondly, we propose a conversation context-aware end-to-end TTS approach that employs an auxiliary encoder and a conversational context encoder to specifically reinforce the information about the current utterance and its context in a conversation as well. Experimental results show that the proposed approach produces more natural prosody in accordance with the conversational context, with significant preference gains at both utterance-level and conversation-level. Moreover, we find that the model has the ability to express some spontaneous behaviors like fillers and repeated words, which makes the conversational speaking style more realistic.

AB - End-to-end neural TTS has achieved excellent performance on reading style speech synthesis. However, it is still a challenge to build a high-quality conversational TTS due to the limitations of corpus and modeling capability. This study aims at building a conversational TTS for a voice agent under sequence to sequence modeling framework. We firstly construct a spontaneous conversational speech corpus well designed for the voice agent with a new recording scheme ensuring both recording quality and conversational speaking style. Secondly, we propose a conversation context-aware end-to-end TTS approach that employs an auxiliary encoder and a conversational context encoder to specifically reinforce the information about the current utterance and its context in a conversation as well. Experimental results show that the proposed approach produces more natural prosody in accordance with the conversational context, with significant preference gains at both utterance-level and conversation-level. Moreover, we find that the model has the ability to express some spontaneous behaviors like fillers and repeated words, which makes the conversational speaking style more realistic.

KW - Conversational TTS

KW - End-to-End

KW - Speech Corpus

KW - Text-to-Speech

KW - Voice Agent

UR - http://www.scopus.com/inward/record.url?scp=85103983737&partnerID=8YFLogxK

U2 - 10.1109/SLT48900.2021.9383460

DO - 10.1109/SLT48900.2021.9383460

M3 - 会议稿件

AN - SCOPUS:85103983737

T3 - 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings

SP - 403

EP - 409

BT - 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2021 IEEE Spoken Language Technology Workshop, SLT 2021

Y2 - 19 January 2021 through 22 January 2021

ER -

Conversational End-to-End TTS for Voice Agents

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此