StreamVoice+: Evolving Into End-to-End Streaming Zero-Shot Voice Conversion

Zhichao Wang; Yuanzhe Chen; Xinsheng Wang; Lei Xie; Yuping Wang

doi:10.1109/LSP.2024.3483012

StreamVoice+: Evolving Into End-to-End Streaming Zero-Shot Voice Conversion

Zhichao Wang, Yuanzhe Chen, Xinsheng Wang, Lei Xie, Yuping Wang

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

StreamVoice has recently pushed the boundaries of zero-shot voice conversion (VC) in the streaming domain. It uses a streamable language model (LM) with a context-aware approach to convert semantic features from automatic speech recognition (ASR) into acoustic features with the desired speaker timbre. Despite its innovations, StreamVoice faces challenges due to its dependency on a streaming ASR within a cascaded framework, which complicates system deployment and optimization, affects VC system's design and performance based on the choice of ASR, and struggles with conversion stability when faced with low-quality semantic inputs. To overcome these limitations, we introduce StreamVoice+, an enhanced LM-based end-to-end streaming framework that operates independently of streaming ASR. StreamVoice+ integrates a semantic encoder and a connector with the original StreamVoice framework, now trained using a non-streaming ASR. This model undergoes a two-stage training process: initially, the StreamVoice backbone is pre-trained for voice conversion and the semantic encoder for robust semantic extraction. Subsequently, the system is fine-tuned end-to-end, incorporating a LoRA matrix to activate comprehensive streaming functionality. Furthermore, StreamVoice+ mainly introduces two strategic enhancements to boost conversion quality: a residual compensation mechanism in the connector to ensure effective semantic transmission and a self-refinement strategy that leverages pseudo-parallel speech pairs generated by the conversion backbone to improve speech decoupling. Experiments demonstrate that StreamVoice+ not only achieves higher naturalness and speaker similarity in voice conversion than its predecessor but also provides versatile support for both streaming and non-streaming conversion scenarios.

源语言	英语
页（从-至）	3000-3004
页数	5
期刊	IEEE Signal Processing Letters
卷	31
DOI	https://doi.org/10.1109/LSP.2024.3483012
出版状态	已出版 - 2024

访问文件

10.1109/LSP.2024.3483012

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{bd6558936c84407ab01e4f21b519ad9f,

title = "StreamVoice+: Evolving Into End-to-End Streaming Zero-Shot Voice Conversion",

abstract = "StreamVoice has recently pushed the boundaries of zero-shot voice conversion (VC) in the streaming domain. It uses a streamable language model (LM) with a context-aware approach to convert semantic features from automatic speech recognition (ASR) into acoustic features with the desired speaker timbre. Despite its innovations, StreamVoice faces challenges due to its dependency on a streaming ASR within a cascaded framework, which complicates system deployment and optimization, affects VC system's design and performance based on the choice of ASR, and struggles with conversion stability when faced with low-quality semantic inputs. To overcome these limitations, we introduce StreamVoice+, an enhanced LM-based end-to-end streaming framework that operates independently of streaming ASR. StreamVoice+ integrates a semantic encoder and a connector with the original StreamVoice framework, now trained using a non-streaming ASR. This model undergoes a two-stage training process: initially, the StreamVoice backbone is pre-trained for voice conversion and the semantic encoder for robust semantic extraction. Subsequently, the system is fine-tuned end-to-end, incorporating a LoRA matrix to activate comprehensive streaming functionality. Furthermore, StreamVoice+ mainly introduces two strategic enhancements to boost conversion quality: a residual compensation mechanism in the connector to ensure effective semantic transmission and a self-refinement strategy that leverages pseudo-parallel speech pairs generated by the conversion backbone to improve speech decoupling. Experiments demonstrate that StreamVoice+ not only achieves higher naturalness and speaker similarity in voice conversion than its predecessor but also provides versatile support for both streaming and non-streaming conversion scenarios.",

keywords = "Streaming voice conversion, end-to-end, language model, parameter-efficient fine-tuning, zero-shot",

author = "Zhichao Wang and Yuanzhe Chen and Xinsheng Wang and Lei Xie and Yuping Wang",

note = "Publisher Copyright: {\textcopyright} 1994-2012 IEEE.",

year = "2024",

doi = "10.1109/LSP.2024.3483012",

language = "英语",

volume = "31",

pages = "3000--3004",

journal = "IEEE Signal Processing Letters",

issn = "1070-9908",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - StreamVoice+

T2 - Evolving Into End-to-End Streaming Zero-Shot Voice Conversion

AU - Wang, Zhichao

AU - Chen, Yuanzhe

AU - Wang, Xinsheng

AU - Xie, Lei

AU - Wang, Yuping

PY - 2024

Y1 - 2024

N2 - StreamVoice has recently pushed the boundaries of zero-shot voice conversion (VC) in the streaming domain. It uses a streamable language model (LM) with a context-aware approach to convert semantic features from automatic speech recognition (ASR) into acoustic features with the desired speaker timbre. Despite its innovations, StreamVoice faces challenges due to its dependency on a streaming ASR within a cascaded framework, which complicates system deployment and optimization, affects VC system's design and performance based on the choice of ASR, and struggles with conversion stability when faced with low-quality semantic inputs. To overcome these limitations, we introduce StreamVoice+, an enhanced LM-based end-to-end streaming framework that operates independently of streaming ASR. StreamVoice+ integrates a semantic encoder and a connector with the original StreamVoice framework, now trained using a non-streaming ASR. This model undergoes a two-stage training process: initially, the StreamVoice backbone is pre-trained for voice conversion and the semantic encoder for robust semantic extraction. Subsequently, the system is fine-tuned end-to-end, incorporating a LoRA matrix to activate comprehensive streaming functionality. Furthermore, StreamVoice+ mainly introduces two strategic enhancements to boost conversion quality: a residual compensation mechanism in the connector to ensure effective semantic transmission and a self-refinement strategy that leverages pseudo-parallel speech pairs generated by the conversion backbone to improve speech decoupling. Experiments demonstrate that StreamVoice+ not only achieves higher naturalness and speaker similarity in voice conversion than its predecessor but also provides versatile support for both streaming and non-streaming conversion scenarios.

AB - StreamVoice has recently pushed the boundaries of zero-shot voice conversion (VC) in the streaming domain. It uses a streamable language model (LM) with a context-aware approach to convert semantic features from automatic speech recognition (ASR) into acoustic features with the desired speaker timbre. Despite its innovations, StreamVoice faces challenges due to its dependency on a streaming ASR within a cascaded framework, which complicates system deployment and optimization, affects VC system's design and performance based on the choice of ASR, and struggles with conversion stability when faced with low-quality semantic inputs. To overcome these limitations, we introduce StreamVoice+, an enhanced LM-based end-to-end streaming framework that operates independently of streaming ASR. StreamVoice+ integrates a semantic encoder and a connector with the original StreamVoice framework, now trained using a non-streaming ASR. This model undergoes a two-stage training process: initially, the StreamVoice backbone is pre-trained for voice conversion and the semantic encoder for robust semantic extraction. Subsequently, the system is fine-tuned end-to-end, incorporating a LoRA matrix to activate comprehensive streaming functionality. Furthermore, StreamVoice+ mainly introduces two strategic enhancements to boost conversion quality: a residual compensation mechanism in the connector to ensure effective semantic transmission and a self-refinement strategy that leverages pseudo-parallel speech pairs generated by the conversion backbone to improve speech decoupling. Experiments demonstrate that StreamVoice+ not only achieves higher naturalness and speaker similarity in voice conversion than its predecessor but also provides versatile support for both streaming and non-streaming conversion scenarios.

KW - Streaming voice conversion

KW - end-to-end

KW - language model

KW - parameter-efficient fine-tuning

KW - zero-shot

UR - http://www.scopus.com/inward/record.url?scp=85207704882&partnerID=8YFLogxK

U2 - 10.1109/LSP.2024.3483012

DO - 10.1109/LSP.2024.3483012

M3 - 文章

AN - SCOPUS:85207704882

SN - 1070-9908

VL - 31

SP - 3000

EP - 3004

JO - IEEE Signal Processing Letters

JF - IEEE Signal Processing Letters

ER -

StreamVoice+: Evolving Into End-to-End Streaming Zero-Shot Voice Conversion

摘要

访问文件

其它文件与链接

指纹

引用此