Promptspeaker: Speaker Generation Based on Text Descriptions

Yongmao Zhang; Guanghou Liu; Yi Lei; Yunlin Chen; Hao Yin; Lei Xie; Zhifei Li

doi:10.1109/ASRU57964.2023.10389772

Promptspeaker: Speaker Generation Based on Text Descriptions

Yongmao Zhang, Guanghou Liu, Yi Lei, Yunlin Chen, Hao Yin, Lei Xie, Zhifei Li

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

8 Scopus citations

Abstract

Recently, text-guided content generation has received extensive attention. In this work, we explore the possibility of text description-based speaker generation, i.e., using text prompts to control the speaker generation process. Specifically, we propose PromptSpeaker, a text-guided speaker generation system. PromptSpeaker consists of a prompt encoder, a zero-shot VITS, and a Glow model, where the prompt encoder predicts a prior distribution based on the text description and samples from this distribution to obtain a semantic representation. The Glow model subsequently converts the semantic representation into a speaker representation, and the zero-shot VITS finally synthesizes the speaker's voice based on the speaker representation. We verify that PromptSpeaker can generate speakers new from the training set by objective metrics, and the synthetic speaker voice has reasonable subjective matching quality with the speaker prompt. Our audio samples are available on the demo website¹¹Demo: https://promptspeaker.github.io/demo/

Original language	English
Title of host publication	2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
Publisher	Institute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)	9798350306897
DOIs	https://doi.org/10.1109/ASRU57964.2023.10389772
State	Published - 2023
Event	2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023 - Taipei, Taiwan, Province of China Duration: 16 Dec 2023 → 20 Dec 2023

Publication series

Name	2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023

Conference

Conference	2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
Country/Territory	Taiwan, Province of China
City	Taipei
Period	16/12/23 → 20/12/23

Keywords

Prompt
Speaker Generation
Text-to-Speech

Access to Document

10.1109/ASRU57964.2023.10389772

Cite this

Zhang, Y., Liu, G., Lei, Y., Chen, Y., Yin, H., Xie, L., & Li, Z. (2023). Promptspeaker: Speaker Generation Based on Text Descriptions. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023 (2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ASRU57964.2023.10389772

@inproceedings{d9e1427e181240e3805e22961d28a232,

title = "Promptspeaker: Speaker Generation Based on Text Descriptions",

abstract = "Recently, text-guided content generation has received extensive attention. In this work, we explore the possibility of text description-based speaker generation, i.e., using text prompts to control the speaker generation process. Specifically, we propose PromptSpeaker, a text-guided speaker generation system. PromptSpeaker consists of a prompt encoder, a zero-shot VITS, and a Glow model, where the prompt encoder predicts a prior distribution based on the text description and samples from this distribution to obtain a semantic representation. The Glow model subsequently converts the semantic representation into a speaker representation, and the zero-shot VITS finally synthesizes the speaker's voice based on the speaker representation. We verify that PromptSpeaker can generate speakers new from the training set by objective metrics, and the synthetic speaker voice has reasonable subjective matching quality with the speaker prompt. Our audio samples are available on the demo website11Demo: https://promptspeaker.github.io/demo/",

keywords = "Prompt, Speaker Generation, Text-to-Speech",

author = "Yongmao Zhang and Guanghou Liu and Yi Lei and Yunlin Chen and Hao Yin and Lei Xie and Zhifei Li",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023 ; Conference date: 16-12-2023 Through 20-12-2023",

year = "2023",

doi = "10.1109/ASRU57964.2023.10389772",

language = "英语",

series = "2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

booktitle = "2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023",

}

Zhang, Y, Liu, G, Lei, Y, Chen, Y, Yin, H, Xie, L & Li, Z 2023, Promptspeaker: Speaker Generation Based on Text Descriptions. in 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023. 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023, Institute of Electrical and Electronics Engineers Inc., 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023, Taipei, Taiwan, Province of China, 16/12/23. https://doi.org/10.1109/ASRU57964.2023.10389772

Promptspeaker: Speaker Generation Based on Text Descriptions. / Zhang, Yongmao; Liu, Guanghou; Lei, Yi et al.
2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023. Institute of Electrical and Electronics Engineers Inc., 2023. (2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Promptspeaker

T2 - 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023

AU - Zhang, Yongmao

AU - Liu, Guanghou

AU - Lei, Yi

AU - Chen, Yunlin

AU - Yin, Hao

AU - Xie, Lei

AU - Li, Zhifei

PY - 2023

Y1 - 2023

N2 - Recently, text-guided content generation has received extensive attention. In this work, we explore the possibility of text description-based speaker generation, i.e., using text prompts to control the speaker generation process. Specifically, we propose PromptSpeaker, a text-guided speaker generation system. PromptSpeaker consists of a prompt encoder, a zero-shot VITS, and a Glow model, where the prompt encoder predicts a prior distribution based on the text description and samples from this distribution to obtain a semantic representation. The Glow model subsequently converts the semantic representation into a speaker representation, and the zero-shot VITS finally synthesizes the speaker's voice based on the speaker representation. We verify that PromptSpeaker can generate speakers new from the training set by objective metrics, and the synthetic speaker voice has reasonable subjective matching quality with the speaker prompt. Our audio samples are available on the demo website11Demo: https://promptspeaker.github.io/demo/

AB - Recently, text-guided content generation has received extensive attention. In this work, we explore the possibility of text description-based speaker generation, i.e., using text prompts to control the speaker generation process. Specifically, we propose PromptSpeaker, a text-guided speaker generation system. PromptSpeaker consists of a prompt encoder, a zero-shot VITS, and a Glow model, where the prompt encoder predicts a prior distribution based on the text description and samples from this distribution to obtain a semantic representation. The Glow model subsequently converts the semantic representation into a speaker representation, and the zero-shot VITS finally synthesizes the speaker's voice based on the speaker representation. We verify that PromptSpeaker can generate speakers new from the training set by objective metrics, and the synthetic speaker voice has reasonable subjective matching quality with the speaker prompt. Our audio samples are available on the demo website11Demo: https://promptspeaker.github.io/demo/

KW - Prompt

KW - Speaker Generation

KW - Text-to-Speech

UR - http://www.scopus.com/inward/record.url?scp=85184660751&partnerID=8YFLogxK

U2 - 10.1109/ASRU57964.2023.10389772

DO - 10.1109/ASRU57964.2023.10389772

M3 - 会议稿件

AN - SCOPUS:85184660751

T3 - 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023

BT - 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 16 December 2023 through 20 December 2023

ER -

Zhang Y, Liu G, Lei Y, Chen Y, Yin H, Xie L et al. Promptspeaker: Speaker Generation Based on Text Descriptions. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023. Institute of Electrical and Electronics Engineers Inc. 2023. (2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023). doi: 10.1109/ASRU57964.2023.10389772

Promptspeaker: Speaker Generation Based on Text Descriptions

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this