On the use of I-vectors and average voice model for voice conversion without parallel data

Jie Wu; Zhizheng Wu; Lei Xie

doi:10.1109/APSIPA.2016.7820901

On the use of I-vectors and average voice model for voice conversion without parallel data

Jie Wu, Zhizheng Wu, Lei Xie

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

16 Scopus citations

Abstract

Recently, deep and/or recurrent neural networks (DNNs/RNNs) have been employed for voice conversion, and have significantly improved the performance of converted speech. However, DNNs/RNNs generally require a large amount of parallel training data (e.g., hundreds of utterances) from source and target speakers. It is expensive to collect such a large amount of data, and impossible in some applications, such as cross-lingual conversion. To solve this problem, we propose to use average voice model and i-vectors for long short-term memory (LSTM) based voice conversion, which does not require parallel data from source and target speakers. The average voice model is trained using other speakers' data, and the i-vectors, a compact vector representing the identities of source and target speakers, are extracted independently. Subjective evaluation has confirmed the effectiveness of the proposed approach.

Original language	English
Title of host publication	2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016
Publisher	Institute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)	9789881476821
DOIs	https://doi.org/10.1109/APSIPA.2016.7820901
State	Published - 17 Jan 2017
Event	2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016 - Jeju, Korea, Republic of Duration: 13 Dec 2016 → 16 Dec 2016

Publication series

Name	2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016

Conference

Conference	2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016
Country/Territory	Korea, Republic of
City	Jeju
Period	13/12/16 → 16/12/16

Keywords

average voice model
i-vector
long short-term memory
nonparallel training
voice conversion

Access to Document

10.1109/APSIPA.2016.7820901

Cite this

Wu, J., Wu, Z., & Xie, L. (2017). On the use of I-vectors and average voice model for voice conversion without parallel data. In 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016 Article 7820901 (2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/APSIPA.2016.7820901

Wu, Jie ; Wu, Zhizheng ; Xie, Lei. / On the use of I-vectors and average voice model for voice conversion without parallel data. 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016. Institute of Electrical and Electronics Engineers Inc., 2017. (2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016).

@inproceedings{ab0047711d2c4f71aae5cfd8580b9780,

title = "On the use of I-vectors and average voice model for voice conversion without parallel data",

abstract = "Recently, deep and/or recurrent neural networks (DNNs/RNNs) have been employed for voice conversion, and have significantly improved the performance of converted speech. However, DNNs/RNNs generally require a large amount of parallel training data (e.g., hundreds of utterances) from source and target speakers. It is expensive to collect such a large amount of data, and impossible in some applications, such as cross-lingual conversion. To solve this problem, we propose to use average voice model and i-vectors for long short-term memory (LSTM) based voice conversion, which does not require parallel data from source and target speakers. The average voice model is trained using other speakers' data, and the i-vectors, a compact vector representing the identities of source and target speakers, are extracted independently. Subjective evaluation has confirmed the effectiveness of the proposed approach.",

keywords = "average voice model, i-vector, long short-term memory, nonparallel training, voice conversion",

author = "Jie Wu and Zhizheng Wu and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2016 Asia Pacific Signal and Information Processing Association.; 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016 ; Conference date: 13-12-2016 Through 16-12-2016",

year = "2017",

month = jan,

day = "17",

doi = "10.1109/APSIPA.2016.7820901",

language = "英语",

series = "2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

booktitle = "2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016",

}

Wu, J, Wu, Z & Xie, L 2017, On the use of I-vectors and average voice model for voice conversion without parallel data. in 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016., 7820901, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016, Institute of Electrical and Electronics Engineers Inc., 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016, Jeju, Korea, Republic of, 13/12/16. https://doi.org/10.1109/APSIPA.2016.7820901

On the use of I-vectors and average voice model for voice conversion without parallel data. / Wu, Jie; Wu, Zhizheng; Xie, Lei.
2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016. Institute of Electrical and Electronics Engineers Inc., 2017. 7820901 (2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - On the use of I-vectors and average voice model for voice conversion without parallel data

AU - Wu, Jie

AU - Wu, Zhizheng

AU - Xie, Lei

PY - 2017/1/17

Y1 - 2017/1/17

N2 - Recently, deep and/or recurrent neural networks (DNNs/RNNs) have been employed for voice conversion, and have significantly improved the performance of converted speech. However, DNNs/RNNs generally require a large amount of parallel training data (e.g., hundreds of utterances) from source and target speakers. It is expensive to collect such a large amount of data, and impossible in some applications, such as cross-lingual conversion. To solve this problem, we propose to use average voice model and i-vectors for long short-term memory (LSTM) based voice conversion, which does not require parallel data from source and target speakers. The average voice model is trained using other speakers' data, and the i-vectors, a compact vector representing the identities of source and target speakers, are extracted independently. Subjective evaluation has confirmed the effectiveness of the proposed approach.

AB - Recently, deep and/or recurrent neural networks (DNNs/RNNs) have been employed for voice conversion, and have significantly improved the performance of converted speech. However, DNNs/RNNs generally require a large amount of parallel training data (e.g., hundreds of utterances) from source and target speakers. It is expensive to collect such a large amount of data, and impossible in some applications, such as cross-lingual conversion. To solve this problem, we propose to use average voice model and i-vectors for long short-term memory (LSTM) based voice conversion, which does not require parallel data from source and target speakers. The average voice model is trained using other speakers' data, and the i-vectors, a compact vector representing the identities of source and target speakers, are extracted independently. Subjective evaluation has confirmed the effectiveness of the proposed approach.

KW - average voice model

KW - i-vector

KW - long short-term memory

KW - nonparallel training

KW - voice conversion

UR - http://www.scopus.com/inward/record.url?scp=85013851128&partnerID=8YFLogxK

U2 - 10.1109/APSIPA.2016.7820901

DO - 10.1109/APSIPA.2016.7820901

M3 - 会议稿件

AN - SCOPUS:85013851128

T3 - 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016

BT - 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016

Y2 - 13 December 2016 through 16 December 2016

ER -

Wu J, Wu Z, Xie L. On the use of I-vectors and average voice model for voice conversion without parallel data. In 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016. Institute of Electrical and Electronics Engineers Inc. 2017. 7820901. (2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016). doi: 10.1109/APSIPA.2016.7820901

On the use of I-vectors and average voice model for voice conversion without parallel data

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this