DUALVC 2: DYNAMIC MASKED CONVOLUTION FOR UNIFIED STREAMING AND NON-STREAMING VOICE CONVERSION

Ziqian Ning; Yuepeng Jiang; Pengcheng Zhu; Shuai Wang; Jixun Yao; Lei Xie; Mengxiao Bi

doi:10.1109/ICASSP48485.2024.10446229

DUALVC 2: DYNAMIC MASKED CONVOLUTION FOR UNIFIED STREAMING AND NON-STREAMING VOICE CONVERSION

Ziqian Ning, Yuepeng Jiang, Pengcheng Zhu, Shuai Wang, Jixun Yao, Lei Xie, Mengxiao Bi

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

5 Scopus citations

Abstract

Voice conversion is becoming increasingly popular, and a growing number of application scenarios require models with streaming inference capabilities. The recently proposed DualVC attempts to achieve this objective through streaming model architecture design and intra-model knowledge distillation along with hybrid predictive coding to compensate for the lack of future information. However, DualVC encounters several problems that limit its performance. First, the autoregressive decoder has error accumulation in its nature and limits the inference speed as well. Second, the causal convolution enables streaming capability but cannot sufficiently use future information within chunks. Third, the model is unable to effectively address the noise in the unvoiced segments, lowering the sound quality. In this paper, we propose DualVC 2 to address these issues. Specifically, the model backbone is migrated to a Conformer-based architecture, empowering parallel inference. Causal convolution is replaced by non-causal convolution with a dynamic chunk mask to make better use of within-chunk future information. Also, quiet attention is introduced to enhance the model's noise robustness. Experiments show that DualVC 2 outperforms DualVC and other baseline systems in both subjective and objective metrics, with only 186.4 ms latency. Our audio samples are made publicly available.

Original language	English
Title of host publication	2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	11106-11110
Number of pages	5
ISBN (Electronic)	9798350344851
DOIs	https://doi.org/10.1109/ICASSP48485.2024.10446229
State	Published - 2024
Event	2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Seoul, Korea, Republic of Duration: 14 Apr 2024 → 19 Apr 2024

Publication series

Name	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
ISSN (Print)	1520-6149

Conference

Conference	2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024
Country/Territory	Korea, Republic of
City	Seoul
Period	14/04/24 → 19/04/24

Keywords

Conformer
dynamic masked convolution
quiet attention
streaming voice conversion

Access to Document

10.1109/ICASSP48485.2024.10446229

Cite this

Ning, Z., Jiang, Y., Zhu, P., Wang, S., Yao, J., Xie, L., & Bi, M. (2024). DUALVC 2: DYNAMIC MASKED CONVOLUTION FOR UNIFIED STREAMING AND NON-STREAMING VOICE CONVERSION. In 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings (pp. 11106-11110). (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP48485.2024.10446229

Ning, Ziqian ; Jiang, Yuepeng ; Zhu, Pengcheng et al. / DUALVC 2 : DYNAMIC MASKED CONVOLUTION FOR UNIFIED STREAMING AND NON-STREAMING VOICE CONVERSION. 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2024. pp. 11106-11110 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

@inproceedings{1b184531321c438d955d2d081cad6cb1,

title = "DUALVC 2: DYNAMIC MASKED CONVOLUTION FOR UNIFIED STREAMING AND NON-STREAMING VOICE CONVERSION",

abstract = "Voice conversion is becoming increasingly popular, and a growing number of application scenarios require models with streaming inference capabilities. The recently proposed DualVC attempts to achieve this objective through streaming model architecture design and intra-model knowledge distillation along with hybrid predictive coding to compensate for the lack of future information. However, DualVC encounters several problems that limit its performance. First, the autoregressive decoder has error accumulation in its nature and limits the inference speed as well. Second, the causal convolution enables streaming capability but cannot sufficiently use future information within chunks. Third, the model is unable to effectively address the noise in the unvoiced segments, lowering the sound quality. In this paper, we propose DualVC 2 to address these issues. Specifically, the model backbone is migrated to a Conformer-based architecture, empowering parallel inference. Causal convolution is replaced by non-causal convolution with a dynamic chunk mask to make better use of within-chunk future information. Also, quiet attention is introduced to enhance the model's noise robustness. Experiments show that DualVC 2 outperforms DualVC and other baseline systems in both subjective and objective metrics, with only 186.4 ms latency. Our audio samples are made publicly available.",

keywords = "Conformer, dynamic masked convolution, quiet attention, streaming voice conversion",

author = "Ziqian Ning and Yuepeng Jiang and Pengcheng Zhu and Shuai Wang and Jixun Yao and Lei Xie and Mengxiao Bi",

note = "Publisher Copyright: {\textcopyright} 2024 IEEE.; 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 ; Conference date: 14-04-2024 Through 19-04-2024",

year = "2024",

doi = "10.1109/ICASSP48485.2024.10446229",

language = "英语",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "11106--11110",

booktitle = "2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings",

}

Ning, Z, Jiang, Y, Zhu, P, Wang, S, Yao, J, Xie, L & Bi, M 2024, DUALVC 2: DYNAMIC MASKED CONVOLUTION FOR UNIFIED STREAMING AND NON-STREAMING VOICE CONVERSION. in 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Institute of Electrical and Electronics Engineers Inc., pp. 11106-11110, 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, Korea, Republic of, 14/04/24. https://doi.org/10.1109/ICASSP48485.2024.10446229

DUALVC 2: DYNAMIC MASKED CONVOLUTION FOR UNIFIED STREAMING AND NON-STREAMING VOICE CONVERSION. / Ning, Ziqian; Jiang, Yuepeng; Zhu, Pengcheng et al.
2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2024. p. 11106-11110 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - DUALVC 2

T2 - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024

AU - Ning, Ziqian

AU - Jiang, Yuepeng

AU - Zhu, Pengcheng

AU - Wang, Shuai

AU - Yao, Jixun

AU - Xie, Lei

AU - Bi, Mengxiao

PY - 2024

Y1 - 2024

N2 - Voice conversion is becoming increasingly popular, and a growing number of application scenarios require models with streaming inference capabilities. The recently proposed DualVC attempts to achieve this objective through streaming model architecture design and intra-model knowledge distillation along with hybrid predictive coding to compensate for the lack of future information. However, DualVC encounters several problems that limit its performance. First, the autoregressive decoder has error accumulation in its nature and limits the inference speed as well. Second, the causal convolution enables streaming capability but cannot sufficiently use future information within chunks. Third, the model is unable to effectively address the noise in the unvoiced segments, lowering the sound quality. In this paper, we propose DualVC 2 to address these issues. Specifically, the model backbone is migrated to a Conformer-based architecture, empowering parallel inference. Causal convolution is replaced by non-causal convolution with a dynamic chunk mask to make better use of within-chunk future information. Also, quiet attention is introduced to enhance the model's noise robustness. Experiments show that DualVC 2 outperforms DualVC and other baseline systems in both subjective and objective metrics, with only 186.4 ms latency. Our audio samples are made publicly available.

AB - Voice conversion is becoming increasingly popular, and a growing number of application scenarios require models with streaming inference capabilities. The recently proposed DualVC attempts to achieve this objective through streaming model architecture design and intra-model knowledge distillation along with hybrid predictive coding to compensate for the lack of future information. However, DualVC encounters several problems that limit its performance. First, the autoregressive decoder has error accumulation in its nature and limits the inference speed as well. Second, the causal convolution enables streaming capability but cannot sufficiently use future information within chunks. Third, the model is unable to effectively address the noise in the unvoiced segments, lowering the sound quality. In this paper, we propose DualVC 2 to address these issues. Specifically, the model backbone is migrated to a Conformer-based architecture, empowering parallel inference. Causal convolution is replaced by non-causal convolution with a dynamic chunk mask to make better use of within-chunk future information. Also, quiet attention is introduced to enhance the model's noise robustness. Experiments show that DualVC 2 outperforms DualVC and other baseline systems in both subjective and objective metrics, with only 186.4 ms latency. Our audio samples are made publicly available.

KW - Conformer

KW - dynamic masked convolution

KW - quiet attention

KW - streaming voice conversion

UR - http://www.scopus.com/inward/record.url?scp=85195392616&partnerID=8YFLogxK

U2 - 10.1109/ICASSP48485.2024.10446229

DO - 10.1109/ICASSP48485.2024.10446229

M3 - 会议稿件

AN - SCOPUS:85195392616

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 11106

EP - 11110

BT - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 14 April 2024 through 19 April 2024

ER -

Ning Z, Jiang Y, Zhu P, Wang S, Yao J, Xie L et al. DUALVC 2: DYNAMIC MASKED CONVOLUTION FOR UNIFIED STREAMING AND NON-STREAMING VOICE CONVERSION. In 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2024. p. 11106-11110. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.1109/ICASSP48485.2024.10446229

DUALVC 2: DYNAMIC MASKED CONVOLUTION FOR UNIFIED STREAMING AND NON-STREAMING VOICE CONVERSION

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this