On the use of I-vectors and average voice model for voice conversion without parallel data

Jie Wu; Zhizheng Wu; Lei Xie

doi:10.1109/APSIPA.2016.7820901

On the use of I-vectors and average voice model for voice conversion without parallel data

Jie Wu, Zhizheng Wu, Lei Xie

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

16 引用（Scopus）

摘要

Recently, deep and/or recurrent neural networks (DNNs/RNNs) have been employed for voice conversion, and have significantly improved the performance of converted speech. However, DNNs/RNNs generally require a large amount of parallel training data (e.g., hundreds of utterances) from source and target speakers. It is expensive to collect such a large amount of data, and impossible in some applications, such as cross-lingual conversion. To solve this problem, we propose to use average voice model and i-vectors for long short-term memory (LSTM) based voice conversion, which does not require parallel data from source and target speakers. The average voice model is trained using other speakers' data, and the i-vectors, a compact vector representing the identities of source and target speakers, are extracted independently. Subjective evaluation has confirmed the effectiveness of the proposed approach.

源语言	英语
主期刊名	2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016
出版商	Institute of Electrical and Electronics Engineers Inc.
ISBN（电子版）	9789881476821
DOI	https://doi.org/10.1109/APSIPA.2016.7820901
出版状态	已出版 - 17 1月 2017
活动	2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016 - Jeju, 韩国期限: 13 12月 2016 → 16 12月 2016

出版系列

姓名	2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016

会议

会议	2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016
国家/地区	韩国
市	Jeju
时期	13/12/16 → 16/12/16

访问文件

10.1109/APSIPA.2016.7820901

其它文件与链接

链接到 Scopus 的出版物

引用此

Wu, J., Wu, Z., & Xie, L. (2017). On the use of I-vectors and average voice model for voice conversion without parallel data. 在 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016 文章 7820901 (2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/APSIPA.2016.7820901

Wu, Jie ; Wu, Zhizheng ; Xie, Lei. / On the use of I-vectors and average voice model for voice conversion without parallel data. 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016. Institute of Electrical and Electronics Engineers Inc., 2017. (2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016).

@inproceedings{ab0047711d2c4f71aae5cfd8580b9780,

title = "On the use of I-vectors and average voice model for voice conversion without parallel data",

abstract = "Recently, deep and/or recurrent neural networks (DNNs/RNNs) have been employed for voice conversion, and have significantly improved the performance of converted speech. However, DNNs/RNNs generally require a large amount of parallel training data (e.g., hundreds of utterances) from source and target speakers. It is expensive to collect such a large amount of data, and impossible in some applications, such as cross-lingual conversion. To solve this problem, we propose to use average voice model and i-vectors for long short-term memory (LSTM) based voice conversion, which does not require parallel data from source and target speakers. The average voice model is trained using other speakers' data, and the i-vectors, a compact vector representing the identities of source and target speakers, are extracted independently. Subjective evaluation has confirmed the effectiveness of the proposed approach.",

keywords = "average voice model, i-vector, long short-term memory, nonparallel training, voice conversion",

author = "Jie Wu and Zhizheng Wu and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2016 Asia Pacific Signal and Information Processing Association.; 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016 ; Conference date: 13-12-2016 Through 16-12-2016",

year = "2017",

month = jan,

day = "17",

doi = "10.1109/APSIPA.2016.7820901",

language = "英语",

series = "2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

booktitle = "2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016",

}

Wu, J, Wu, Z & Xie, L 2017, On the use of I-vectors and average voice model for voice conversion without parallel data. 在 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016., 7820901, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016, Institute of Electrical and Electronics Engineers Inc., 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016, Jeju, 韩国, 13/12/16. https://doi.org/10.1109/APSIPA.2016.7820901

On the use of I-vectors and average voice model for voice conversion without parallel data. / Wu, Jie; Wu, Zhizheng; Xie, Lei.
2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016. Institute of Electrical and Electronics Engineers Inc., 2017. 7820901 (2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - On the use of I-vectors and average voice model for voice conversion without parallel data

AU - Wu, Jie

AU - Wu, Zhizheng

AU - Xie, Lei

PY - 2017/1/17

Y1 - 2017/1/17

N2 - Recently, deep and/or recurrent neural networks (DNNs/RNNs) have been employed for voice conversion, and have significantly improved the performance of converted speech. However, DNNs/RNNs generally require a large amount of parallel training data (e.g., hundreds of utterances) from source and target speakers. It is expensive to collect such a large amount of data, and impossible in some applications, such as cross-lingual conversion. To solve this problem, we propose to use average voice model and i-vectors for long short-term memory (LSTM) based voice conversion, which does not require parallel data from source and target speakers. The average voice model is trained using other speakers' data, and the i-vectors, a compact vector representing the identities of source and target speakers, are extracted independently. Subjective evaluation has confirmed the effectiveness of the proposed approach.

AB - Recently, deep and/or recurrent neural networks (DNNs/RNNs) have been employed for voice conversion, and have significantly improved the performance of converted speech. However, DNNs/RNNs generally require a large amount of parallel training data (e.g., hundreds of utterances) from source and target speakers. It is expensive to collect such a large amount of data, and impossible in some applications, such as cross-lingual conversion. To solve this problem, we propose to use average voice model and i-vectors for long short-term memory (LSTM) based voice conversion, which does not require parallel data from source and target speakers. The average voice model is trained using other speakers' data, and the i-vectors, a compact vector representing the identities of source and target speakers, are extracted independently. Subjective evaluation has confirmed the effectiveness of the proposed approach.

KW - average voice model

KW - i-vector

KW - long short-term memory

KW - nonparallel training

KW - voice conversion

UR - http://www.scopus.com/inward/record.url?scp=85013851128&partnerID=8YFLogxK

U2 - 10.1109/APSIPA.2016.7820901

DO - 10.1109/APSIPA.2016.7820901

M3 - 会议稿件

AN - SCOPUS:85013851128

T3 - 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016

BT - 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016

Y2 - 13 December 2016 through 16 December 2016

ER -

Wu J, Wu Z, Xie L. On the use of I-vectors and average voice model for voice conversion without parallel data. 在 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016. Institute of Electrical and Electronics Engineers Inc. 2017. 7820901. (2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016). doi: 10.1109/APSIPA.2016.7820901

On the use of I-vectors and average voice model for voice conversion without parallel data

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此