UniStyle: Unified Style Modeling for Speaking Style Captioning and Stylistic Speech Synthesis

Xinfa Zhu; Wenjie Tian; Xinsheng Wang; Lei He; Yujia Xiao; Xi Wang; Xu Tan; Sheng Zhao; Lei Xie

doi:10.1145/3664647.3681465

UniStyle: Unified Style Modeling for Speaking Style Captioning and Stylistic Speech Synthesis

Xinfa Zhu, Wenjie Tian, Xinsheng Wang, Lei He, Yujia Xiao, Xi Wang, Xu Tan, Sheng Zhao, Lei Xie

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

Understanding the speaking style, such as the emotion of the interlocutor's speech, and responding with speech in an appropriate style is a natural occurrence in human conversations. However, technically, existing research on speech synthesis and speaking style captioning typically proceeds independently. In this work, an innovative framework, referred to as UniStyle, is proposed to incorporate both the capabilities of speaking style captioning and style-controllable speech synthesizing. Specifically, UniStyle consists of a UniConnector and a style prompt-based speech generator. The role of the UniConnector is to bridge the gap between different modalities, namely speech audio and text descriptions. It enables the generation of text descriptions with speech as input and the creation of style representations from text descriptions for speech synthesis with the speech generator. Besides, to overcome the issue of data scarcity, we propose a two-stage and semi-supervised training strategy, which reduces data requirements while boosting performance. Extensive experiments conducted on open-source corpora demonstrate that UniStyle achieves state-of-the-art performance in speaking style captioning and synthesizes expressive speech with various speaker timbres and speaking styles in a zero-shot manner.

Original language	English
Title of host publication	MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
Publisher	Association for Computing Machinery, Inc
Pages	7513-7522
Number of pages	10
ISBN (Electronic)	9798400706868
DOIs	https://doi.org/10.1145/3664647.3681465
State	Published - 28 Oct 2024
Event	32nd ACM International Conference on Multimedia, MM 2024 - Melbourne, Australia Duration: 28 Oct 2024 → 1 Nov 2024

Publication series

Name	MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

Conference

Conference	32nd ACM International Conference on Multimedia, MM 2024
Country/Territory	Australia
City	Melbourne
Period	28/10/24 → 1/11/24

Keywords

data scarcity
speaking style captioning
style modeling
text-to-speech

Access to Document

10.1145/3664647.3681465

Cite this

Zhu, X., Tian, W., Wang, X., He, L., Xiao, Y., Wang, X., Tan, X., Zhao, S., & Xie, L. (2024). UniStyle: Unified Style Modeling for Speaking Style Captioning and Stylistic Speech Synthesis. In MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia (pp. 7513-7522). (MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia). Association for Computing Machinery, Inc. https://doi.org/10.1145/3664647.3681465

@inproceedings{14cc9a04bab84e31a329f4b3b4b250f2,

title = "UniStyle: Unified Style Modeling for Speaking Style Captioning and Stylistic Speech Synthesis",

abstract = "Understanding the speaking style, such as the emotion of the interlocutor's speech, and responding with speech in an appropriate style is a natural occurrence in human conversations. However, technically, existing research on speech synthesis and speaking style captioning typically proceeds independently. In this work, an innovative framework, referred to as UniStyle, is proposed to incorporate both the capabilities of speaking style captioning and style-controllable speech synthesizing. Specifically, UniStyle consists of a UniConnector and a style prompt-based speech generator. The role of the UniConnector is to bridge the gap between different modalities, namely speech audio and text descriptions. It enables the generation of text descriptions with speech as input and the creation of style representations from text descriptions for speech synthesis with the speech generator. Besides, to overcome the issue of data scarcity, we propose a two-stage and semi-supervised training strategy, which reduces data requirements while boosting performance. Extensive experiments conducted on open-source corpora demonstrate that UniStyle achieves state-of-the-art performance in speaking style captioning and synthesizes expressive speech with various speaker timbres and speaking styles in a zero-shot manner.",

keywords = "data scarcity, speaking style captioning, style modeling, text-to-speech",

author = "Xinfa Zhu and Wenjie Tian and Xinsheng Wang and Lei He and Yujia Xiao and Xi Wang and Xu Tan and Sheng Zhao and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2024 ACM.; 32nd ACM International Conference on Multimedia, MM 2024 ; Conference date: 28-10-2024 Through 01-11-2024",

year = "2024",

month = oct,

day = "28",

doi = "10.1145/3664647.3681465",

language = "英语",

series = "MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia",

publisher = "Association for Computing Machinery, Inc",

pages = "7513--7522",

booktitle = "MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia",

}

Zhu, X, Tian, W, Wang, X, He, L, Xiao, Y, Wang, X, Tan, X, Zhao, S & Xie, L 2024, UniStyle: Unified Style Modeling for Speaking Style Captioning and Stylistic Speech Synthesis. in MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia. MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia, Association for Computing Machinery, Inc, pp. 7513-7522, 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, Australia, 28/10/24. https://doi.org/10.1145/3664647.3681465

UniStyle: Unified Style Modeling for Speaking Style Captioning and Stylistic Speech Synthesis. / Zhu, Xinfa; Tian, Wenjie; Wang, Xinsheng et al.
MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia. Association for Computing Machinery, Inc, 2024. p. 7513-7522 (MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - UniStyle

T2 - 32nd ACM International Conference on Multimedia, MM 2024

AU - Zhu, Xinfa

AU - Tian, Wenjie

AU - Wang, Xinsheng

AU - He, Lei

AU - Xiao, Yujia

AU - Wang, Xi

AU - Tan, Xu

AU - Zhao, Sheng

AU - Xie, Lei

PY - 2024/10/28

Y1 - 2024/10/28

N2 - Understanding the speaking style, such as the emotion of the interlocutor's speech, and responding with speech in an appropriate style is a natural occurrence in human conversations. However, technically, existing research on speech synthesis and speaking style captioning typically proceeds independently. In this work, an innovative framework, referred to as UniStyle, is proposed to incorporate both the capabilities of speaking style captioning and style-controllable speech synthesizing. Specifically, UniStyle consists of a UniConnector and a style prompt-based speech generator. The role of the UniConnector is to bridge the gap between different modalities, namely speech audio and text descriptions. It enables the generation of text descriptions with speech as input and the creation of style representations from text descriptions for speech synthesis with the speech generator. Besides, to overcome the issue of data scarcity, we propose a two-stage and semi-supervised training strategy, which reduces data requirements while boosting performance. Extensive experiments conducted on open-source corpora demonstrate that UniStyle achieves state-of-the-art performance in speaking style captioning and synthesizes expressive speech with various speaker timbres and speaking styles in a zero-shot manner.

AB - Understanding the speaking style, such as the emotion of the interlocutor's speech, and responding with speech in an appropriate style is a natural occurrence in human conversations. However, technically, existing research on speech synthesis and speaking style captioning typically proceeds independently. In this work, an innovative framework, referred to as UniStyle, is proposed to incorporate both the capabilities of speaking style captioning and style-controllable speech synthesizing. Specifically, UniStyle consists of a UniConnector and a style prompt-based speech generator. The role of the UniConnector is to bridge the gap between different modalities, namely speech audio and text descriptions. It enables the generation of text descriptions with speech as input and the creation of style representations from text descriptions for speech synthesis with the speech generator. Besides, to overcome the issue of data scarcity, we propose a two-stage and semi-supervised training strategy, which reduces data requirements while boosting performance. Extensive experiments conducted on open-source corpora demonstrate that UniStyle achieves state-of-the-art performance in speaking style captioning and synthesizes expressive speech with various speaker timbres and speaking styles in a zero-shot manner.

KW - data scarcity

KW - speaking style captioning

KW - style modeling

KW - text-to-speech

UR - http://www.scopus.com/inward/record.url?scp=85209822494&partnerID=8YFLogxK

U2 - 10.1145/3664647.3681465

DO - 10.1145/3664647.3681465

M3 - 会议稿件

AN - SCOPUS:85209822494

T3 - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

SP - 7513

EP - 7522

BT - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

PB - Association for Computing Machinery, Inc

Y2 - 28 October 2024 through 1 November 2024

ER -

Zhu X, Tian W, Wang X, He L, Xiao Y, Wang X et al. UniStyle: Unified Style Modeling for Speaking Style Captioning and Stylistic Speech Synthesis. In MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia. Association for Computing Machinery, Inc. 2024. p. 7513-7522. (MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia). doi: 10.1145/3664647.3681465

UniStyle: Unified Style Modeling for Speaking Style Captioning and Stylistic Speech Synthesis

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this