Accent-VITS: Accent Transfer for End-to-End TTS

Linhan Ma, Yongmao Zhang, Xinfa Zhu, Yi Lei, Ziqian Ning, Pengcheng Zhu, Lei Xie

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Accent transfer aims to transfer an accent from a source speaker to synthetic speech in the target speaker’s voice. The main challenge is how to effectively disentangle speaker timbre and accent which are entangled in speech. This paper presents a VITS-based [7] end-to-end accent transfer model named Accent-VITS. Based on the main structure of VITS, Accent-VITS makes substantial improvements to enable effective and stable accent transfer. We leverage a hierarchical CVAE structure to model accent pronunciation information and acoustic features, respectively, using bottleneck features and mel spectrums as constraints. Moreover, the text-to-wave mapping in VITS is decomposed into text-to-accent and accent-to-wave mappings in Accent-VITS. In this way, the disentanglement of accent and speaker timbre becomes be more stable and effective. Experiments on multi-accent and Mandarin datasets show that Accent-VITS achieves higher speaker similarity, accent similarity and speech naturalness as compared with a strong baseline (Demos: https://anonymous-accentvits.github.io/AccentVITS/).

Original languageEnglish
Title of host publicationMan-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings
EditorsJia Jia, Zhenhua Ling, Xie Chen, Ya Li, Zixing Zhang
PublisherSpringer Science and Business Media Deutschland GmbH
Pages203-214
Number of pages12
ISBN (Print)9789819706006
DOIs
StatePublished - 2024
Event18th National Conference on Man-Machine Speech Communication, NCMMSC 2023 - Suzhou, China
Duration: 8 Dec 202311 Dec 2023

Publication series

NameCommunications in Computer and Information Science
Volume2006
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference18th National Conference on Man-Machine Speech Communication, NCMMSC 2023
Country/TerritoryChina
CitySuzhou
Period8/12/2311/12/23

Keywords

  • Accent transfer
  • Hierarchical
  • Text to speech
  • Variational autoencoder

Cite this