UniStyle: Unified Style Modeling for Speaking Style Captioning and Stylistic Speech Synthesis

Xinfa Zhu, Wenjie Tian, Xinsheng Wang, Lei He, Yujia Xiao, Xi Wang, Xu Tan, Sheng Zhao, Lei Xie

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Understanding the speaking style, such as the emotion of the interlocutor's speech, and responding with speech in an appropriate style is a natural occurrence in human conversations. However, technically, existing research on speech synthesis and speaking style captioning typically proceeds independently. In this work, an innovative framework, referred to as UniStyle, is proposed to incorporate both the capabilities of speaking style captioning and style-controllable speech synthesizing. Specifically, UniStyle consists of a UniConnector and a style prompt-based speech generator. The role of the UniConnector is to bridge the gap between different modalities, namely speech audio and text descriptions. It enables the generation of text descriptions with speech as input and the creation of style representations from text descriptions for speech synthesis with the speech generator. Besides, to overcome the issue of data scarcity, we propose a two-stage and semi-supervised training strategy, which reduces data requirements while boosting performance. Extensive experiments conducted on open-source corpora demonstrate that UniStyle achieves state-of-the-art performance in speaking style captioning and synthesizes expressive speech with various speaker timbres and speaking styles in a zero-shot manner.

源语言英语
主期刊名MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
出版商Association for Computing Machinery, Inc
7513-7522
页数10
ISBN(电子版)9798400706868
DOI
出版状态已出版 - 28 10月 2024
活动32nd ACM International Conference on Multimedia, MM 2024 - Melbourne, 澳大利亚
期限: 28 10月 20241 11月 2024

出版系列

姓名MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

会议

会议32nd ACM International Conference on Multimedia, MM 2024
国家/地区澳大利亚
Melbourne
时期28/10/241/11/24

指纹

探究 'UniStyle: Unified Style Modeling for Speaking Style Captioning and Stylistic Speech Synthesis' 的科研主题。它们共同构成独一无二的指纹。

引用此