A comparison of expressive speech synthesis approaches based on neural network

Liumeng Xue; Xiaolian Zhu; Xiaochun An; Lei Xie

doi:10.1145/3267935.3267947

A comparison of expressive speech synthesis approaches based on neural network

Liumeng Xue, Xiaolian Zhu, Xiaochun An, Lei Xie

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

6 引用（Scopus）

摘要

Adaptability and controllability in changing speaking styles and speaker characteristics are the advantages of deep neural networks (DNNs) based statistical parametric speech synthesis (SPSS). This paper presents a comprehensive study on the use of DNNs for expressive speech synthesis with a small set of emotional speech data. Specifically, we study three typical model adaptation approaches: (1) retraining a neural model by emotion-specific data (retrain), (2) augmenting the network input using emotion-specific codes (code) and (3) using emotion-dependent output layers with shared hidden layers (multi-head). Long-short term memory (LSTM) networks are used as the acoustic models. Objective and subjective evaluations have demonstrated that the multi-head approach consistently outperforms the other two approaches with more natural emotion delivered in the synthesized speech.

源语言	英语
主期刊名	ASMMC-MMAC 2018 - Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and 1st Multi-Modal Affective Computing of Large-Scale Multimedia Data, Co-located with MM 2018
出版商	Association for Computing Machinery, Inc
页	15-20
页数	6
ISBN（电子版）	9781450359856
DOI	https://doi.org/10.1145/3267935.3267947
出版状态	已出版 - 19 10月 2018
活动	Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and 1st Multi-Modal Affective Computing of Large-Scale Multimedia Data Workshop, ASMMC-MMAC 2018 - Seoul, 韩国期限: 26 10月 2018 → …

出版系列

姓名	ASMMC-MMAC 2018 - Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and 1st Multi-Modal Affective Computing of Large-Scale Multimedia Data, Co-located with MM 2018

会议

会议	Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and 1st Multi-Modal Affective Computing of Large-Scale Multimedia Data Workshop, ASMMC-MMAC 2018
国家/地区	韩国
市	Seoul
时期	26/10/18 → …

访问文件

10.1145/3267935.3267947

其它文件与链接

链接到 Scopus 的出版物

引用此

Xue, L., Zhu, X., An, X., & Xie, L. (2018). A comparison of expressive speech synthesis approaches based on neural network. 在 ASMMC-MMAC 2018 - Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and 1st Multi-Modal Affective Computing of Large-Scale Multimedia Data, Co-located with MM 2018 (页码 15-20). (ASMMC-MMAC 2018 - Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and 1st Multi-Modal Affective Computing of Large-Scale Multimedia Data, Co-located with MM 2018). Association for Computing Machinery, Inc. https://doi.org/10.1145/3267935.3267947

Xue, Liumeng ; Zhu, Xiaolian ; An, Xiaochun 等. / A comparison of expressive speech synthesis approaches based on neural network. ASMMC-MMAC 2018 - Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and 1st Multi-Modal Affective Computing of Large-Scale Multimedia Data, Co-located with MM 2018. Association for Computing Machinery, Inc, 2018. 页码 15-20 (ASMMC-MMAC 2018 - Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and 1st Multi-Modal Affective Computing of Large-Scale Multimedia Data, Co-located with MM 2018).

@inproceedings{b5adcc09aa994e67b0911b068ad85f24,

title = "A comparison of expressive speech synthesis approaches based on neural network",

abstract = "Adaptability and controllability in changing speaking styles and speaker characteristics are the advantages of deep neural networks (DNNs) based statistical parametric speech synthesis (SPSS). This paper presents a comprehensive study on the use of DNNs for expressive speech synthesis with a small set of emotional speech data. Specifically, we study three typical model adaptation approaches: (1) retraining a neural model by emotion-specific data (retrain), (2) augmenting the network input using emotion-specific codes (code) and (3) using emotion-dependent output layers with shared hidden layers (multi-head). Long-short term memory (LSTM) networks are used as the acoustic models. Objective and subjective evaluations have demonstrated that the multi-head approach consistently outperforms the other two approaches with more natural emotion delivered in the synthesized speech.",

keywords = "Code, Expressive speech synthesis, Multi-head network, Neural networks, Retrain, Statistical parametric speech synthesis, Text-to-speech",

author = "Liumeng Xue and Xiaolian Zhu and Xiaochun An and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2018 Association for Computing Machinery.; Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and 1st Multi-Modal Affective Computing of Large-Scale Multimedia Data Workshop, ASMMC-MMAC 2018 ; Conference date: 26-10-2018",

year = "2018",

month = oct,

day = "19",

doi = "10.1145/3267935.3267947",

language = "英语",

series = "ASMMC-MMAC 2018 - Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and 1st Multi-Modal Affective Computing of Large-Scale Multimedia Data, Co-located with MM 2018",

publisher = "Association for Computing Machinery, Inc",

pages = "15--20",

booktitle = "ASMMC-MMAC 2018 - Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and 1st Multi-Modal Affective Computing of Large-Scale Multimedia Data, Co-located with MM 2018",

}

Xue, L, Zhu, X, An, X & Xie, L 2018, A comparison of expressive speech synthesis approaches based on neural network. 在 ASMMC-MMAC 2018 - Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and 1st Multi-Modal Affective Computing of Large-Scale Multimedia Data, Co-located with MM 2018. ASMMC-MMAC 2018 - Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and 1st Multi-Modal Affective Computing of Large-Scale Multimedia Data, Co-located with MM 2018, Association for Computing Machinery, Inc, 页码 15-20, Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and 1st Multi-Modal Affective Computing of Large-Scale Multimedia Data Workshop, ASMMC-MMAC 2018, Seoul, 韩国, 26/10/18. https://doi.org/10.1145/3267935.3267947

A comparison of expressive speech synthesis approaches based on neural network. / Xue, Liumeng; Zhu, Xiaolian; An, Xiaochun 等.
ASMMC-MMAC 2018 - Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and 1st Multi-Modal Affective Computing of Large-Scale Multimedia Data, Co-located with MM 2018. Association for Computing Machinery, Inc, 2018. 页码 15-20 (ASMMC-MMAC 2018 - Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and 1st Multi-Modal Affective Computing of Large-Scale Multimedia Data, Co-located with MM 2018).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - A comparison of expressive speech synthesis approaches based on neural network

AU - Xue, Liumeng

AU - Zhu, Xiaolian

AU - An, Xiaochun

AU - Xie, Lei

PY - 2018/10/19

Y1 - 2018/10/19

N2 - Adaptability and controllability in changing speaking styles and speaker characteristics are the advantages of deep neural networks (DNNs) based statistical parametric speech synthesis (SPSS). This paper presents a comprehensive study on the use of DNNs for expressive speech synthesis with a small set of emotional speech data. Specifically, we study three typical model adaptation approaches: (1) retraining a neural model by emotion-specific data (retrain), (2) augmenting the network input using emotion-specific codes (code) and (3) using emotion-dependent output layers with shared hidden layers (multi-head). Long-short term memory (LSTM) networks are used as the acoustic models. Objective and subjective evaluations have demonstrated that the multi-head approach consistently outperforms the other two approaches with more natural emotion delivered in the synthesized speech.

AB - Adaptability and controllability in changing speaking styles and speaker characteristics are the advantages of deep neural networks (DNNs) based statistical parametric speech synthesis (SPSS). This paper presents a comprehensive study on the use of DNNs for expressive speech synthesis with a small set of emotional speech data. Specifically, we study three typical model adaptation approaches: (1) retraining a neural model by emotion-specific data (retrain), (2) augmenting the network input using emotion-specific codes (code) and (3) using emotion-dependent output layers with shared hidden layers (multi-head). Long-short term memory (LSTM) networks are used as the acoustic models. Objective and subjective evaluations have demonstrated that the multi-head approach consistently outperforms the other two approaches with more natural emotion delivered in the synthesized speech.

KW - Code

KW - Expressive speech synthesis

KW - Multi-head network

KW - Neural networks

KW - Retrain

KW - Statistical parametric speech synthesis

KW - Text-to-speech

UR - http://www.scopus.com/inward/record.url?scp=85061709851&partnerID=8YFLogxK

U2 - 10.1145/3267935.3267947

DO - 10.1145/3267935.3267947

M3 - 会议稿件

AN - SCOPUS:85061709851

T3 - ASMMC-MMAC 2018 - Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and 1st Multi-Modal Affective Computing of Large-Scale Multimedia Data, Co-located with MM 2018

SP - 15

EP - 20

BT - ASMMC-MMAC 2018 - Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and 1st Multi-Modal Affective Computing of Large-Scale Multimedia Data, Co-located with MM 2018

PB - Association for Computing Machinery, Inc

T2 - Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and 1st Multi-Modal Affective Computing of Large-Scale Multimedia Data Workshop, ASMMC-MMAC 2018

Y2 - 26 October 2018

ER -

Xue L, Zhu X, An X, Xie L. A comparison of expressive speech synthesis approaches based on neural network. 在 ASMMC-MMAC 2018 - Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and 1st Multi-Modal Affective Computing of Large-Scale Multimedia Data, Co-located with MM 2018. Association for Computing Machinery, Inc. 2018. 页码 15-20. (ASMMC-MMAC 2018 - Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and 1st Multi-Modal Affective Computing of Large-Scale Multimedia Data, Co-located with MM 2018). doi: 10.1145/3267935.3267947

A comparison of expressive speech synthesis approaches based on neural network

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此