Accent-VITS: Accent Transfer for End-to-End TTS

Linhan Ma; Yongmao Zhang; Xinfa Zhu; Yi Lei; Ziqian Ning; Pengcheng Zhu; Lei Xie

doi:10.1007/978-981-97-0601-3_17

Accent-VITS: Accent Transfer for End-to-End TTS

Linhan Ma, Yongmao Zhang, Xinfa Zhu, Yi Lei, Ziqian Ning, Pengcheng Zhu, Lei Xie

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

2 引用（Scopus）

摘要

Accent transfer aims to transfer an accent from a source speaker to synthetic speech in the target speaker’s voice. The main challenge is how to effectively disentangle speaker timbre and accent which are entangled in speech. This paper presents a VITS-based [7] end-to-end accent transfer model named Accent-VITS. Based on the main structure of VITS, Accent-VITS makes substantial improvements to enable effective and stable accent transfer. We leverage a hierarchical CVAE structure to model accent pronunciation information and acoustic features, respectively, using bottleneck features and mel spectrums as constraints. Moreover, the text-to-wave mapping in VITS is decomposed into text-to-accent and accent-to-wave mappings in Accent-VITS. In this way, the disentanglement of accent and speaker timbre becomes be more stable and effective. Experiments on multi-accent and Mandarin datasets show that Accent-VITS achieves higher speaker similarity, accent similarity and speech naturalness as compared with a strong baseline (Demos: https://anonymous-accentvits.github.io/AccentVITS/).

源语言	英语
主期刊名	Man-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings
编辑	Jia Jia, Zhenhua Ling, Xie Chen, Ya Li, Zixing Zhang
出版商	Springer Science and Business Media Deutschland GmbH
页	203-214
页数	12
ISBN（印刷版）	9789819706006
DOI	https://doi.org/10.1007/978-981-97-0601-3_17
出版状态	已出版 - 2024
活动	18th National Conference on Man-Machine Speech Communication, NCMMSC 2023 - Suzhou, 中国期限: 8 12月 2023 → 11 12月 2023

出版系列

姓名	Communications in Computer and Information Science
卷	2006
ISSN（印刷版）	1865-0929
ISSN（电子版）	1865-0937

会议

会议	18th National Conference on Man-Machine Speech Communication, NCMMSC 2023
国家/地区	中国
市	Suzhou
时期	8/12/23 → 11/12/23

访问文件

10.1007/978-981-97-0601-3_17

其它文件与链接

链接到 Scopus 的出版物

引用此

Ma, L., Zhang, Y., Zhu, X., Lei, Y., Ning, Z., Zhu, P., & Xie, L. (2024). Accent-VITS: Accent Transfer for End-to-End TTS. 在 J. Jia, Z. Ling, X. Chen, Y. Li, & Z. Zhang (编辑), Man-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings (页码 203-214). (Communications in Computer and Information Science; 卷 2006). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-97-0601-3_17

Ma, Linhan ; Zhang, Yongmao ; Zhu, Xinfa 等. / Accent-VITS : Accent Transfer for End-to-End TTS. Man-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings. 编辑 / Jia Jia ; Zhenhua Ling ; Xie Chen ; Ya Li ; Zixing Zhang. Springer Science and Business Media Deutschland GmbH, 2024. 页码 203-214 (Communications in Computer and Information Science).

@inproceedings{443c3d9da2dc4d75982b870eca147102,

title = "Accent-VITS: Accent Transfer for End-to-End TTS",

abstract = "Accent transfer aims to transfer an accent from a source speaker to synthetic speech in the target speaker{\textquoteright}s voice. The main challenge is how to effectively disentangle speaker timbre and accent which are entangled in speech. This paper presents a VITS-based [7] end-to-end accent transfer model named Accent-VITS. Based on the main structure of VITS, Accent-VITS makes substantial improvements to enable effective and stable accent transfer. We leverage a hierarchical CVAE structure to model accent pronunciation information and acoustic features, respectively, using bottleneck features and mel spectrums as constraints. Moreover, the text-to-wave mapping in VITS is decomposed into text-to-accent and accent-to-wave mappings in Accent-VITS. In this way, the disentanglement of accent and speaker timbre becomes be more stable and effective. Experiments on multi-accent and Mandarin datasets show that Accent-VITS achieves higher speaker similarity, accent similarity and speech naturalness as compared with a strong baseline (Demos: https://anonymous-accentvits.github.io/AccentVITS/).",

keywords = "Accent transfer, Hierarchical, Text to speech, Variational autoencoder",

author = "Linhan Ma and Yongmao Zhang and Xinfa Zhu and Yi Lei and Ziqian Ning and Pengcheng Zhu and Lei Xie",

note = "Publisher Copyright: {\textcopyright} The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024.; 18th National Conference on Man-Machine Speech Communication, NCMMSC 2023 ; Conference date: 08-12-2023 Through 11-12-2023",

year = "2024",

doi = "10.1007/978-981-97-0601-3_17",

language = "英语",

isbn = "9789819706006",

series = "Communications in Computer and Information Science",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "203--214",

editor = "Jia Jia and Zhenhua Ling and Xie Chen and Ya Li and Zixing Zhang",

booktitle = "Man-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings",

}

Ma, L, Zhang, Y, Zhu, X, Lei, Y, Ning, Z, Zhu, P & Xie, L 2024, Accent-VITS: Accent Transfer for End-to-End TTS. 在 J Jia, Z Ling, X Chen, Y Li & Z Zhang (编辑), Man-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings. Communications in Computer and Information Science, 卷 2006, Springer Science and Business Media Deutschland GmbH, 页码 203-214, 18th National Conference on Man-Machine Speech Communication, NCMMSC 2023, Suzhou, 中国, 8/12/23. https://doi.org/10.1007/978-981-97-0601-3_17

Accent-VITS: Accent Transfer for End-to-End TTS. / Ma, Linhan; Zhang, Yongmao; Zhu, Xinfa 等.
Man-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings. 编辑 / Jia Jia; Zhenhua Ling; Xie Chen; Ya Li; Zixing Zhang. Springer Science and Business Media Deutschland GmbH, 2024. 页码 203-214 (Communications in Computer and Information Science; 卷 2006).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Accent-VITS

T2 - 18th National Conference on Man-Machine Speech Communication, NCMMSC 2023

AU - Ma, Linhan

AU - Zhang, Yongmao

AU - Zhu, Xinfa

AU - Lei, Yi

AU - Ning, Ziqian

AU - Zhu, Pengcheng

AU - Xie, Lei

N1 - Publisher Copyright: © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024.

PY - 2024

Y1 - 2024

N2 - Accent transfer aims to transfer an accent from a source speaker to synthetic speech in the target speaker’s voice. The main challenge is how to effectively disentangle speaker timbre and accent which are entangled in speech. This paper presents a VITS-based [7] end-to-end accent transfer model named Accent-VITS. Based on the main structure of VITS, Accent-VITS makes substantial improvements to enable effective and stable accent transfer. We leverage a hierarchical CVAE structure to model accent pronunciation information and acoustic features, respectively, using bottleneck features and mel spectrums as constraints. Moreover, the text-to-wave mapping in VITS is decomposed into text-to-accent and accent-to-wave mappings in Accent-VITS. In this way, the disentanglement of accent and speaker timbre becomes be more stable and effective. Experiments on multi-accent and Mandarin datasets show that Accent-VITS achieves higher speaker similarity, accent similarity and speech naturalness as compared with a strong baseline (Demos: https://anonymous-accentvits.github.io/AccentVITS/).

AB - Accent transfer aims to transfer an accent from a source speaker to synthetic speech in the target speaker’s voice. The main challenge is how to effectively disentangle speaker timbre and accent which are entangled in speech. This paper presents a VITS-based [7] end-to-end accent transfer model named Accent-VITS. Based on the main structure of VITS, Accent-VITS makes substantial improvements to enable effective and stable accent transfer. We leverage a hierarchical CVAE structure to model accent pronunciation information and acoustic features, respectively, using bottleneck features and mel spectrums as constraints. Moreover, the text-to-wave mapping in VITS is decomposed into text-to-accent and accent-to-wave mappings in Accent-VITS. In this way, the disentanglement of accent and speaker timbre becomes be more stable and effective. Experiments on multi-accent and Mandarin datasets show that Accent-VITS achieves higher speaker similarity, accent similarity and speech naturalness as compared with a strong baseline (Demos: https://anonymous-accentvits.github.io/AccentVITS/).

KW - Accent transfer

KW - Hierarchical

KW - Text to speech

KW - Variational autoencoder

UR - http://www.scopus.com/inward/record.url?scp=85186719628&partnerID=8YFLogxK

U2 - 10.1007/978-981-97-0601-3_17

DO - 10.1007/978-981-97-0601-3_17

M3 - 会议稿件

AN - SCOPUS:85186719628

SN - 9789819706006

T3 - Communications in Computer and Information Science

SP - 203

EP - 214

BT - Man-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings

A2 - Jia, Jia

A2 - Ling, Zhenhua

A2 - Chen, Xie

A2 - Li, Ya

A2 - Zhang, Zixing

PB - Springer Science and Business Media Deutschland GmbH

Y2 - 8 December 2023 through 11 December 2023

ER -

Ma L, Zhang Y, Zhu X, Lei Y, Ning Z, Zhu P 等. Accent-VITS: Accent Transfer for End-to-End TTS. 在 Jia J, Ling Z, Chen X, Li Y, Zhang Z, 编辑, Man-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings. Springer Science and Business Media Deutschland GmbH. 2024. 页码 203-214. (Communications in Computer and Information Science). doi: 10.1007/978-981-97-0601-3_17