TY - GEN
T1 - On the use of I-vectors and average voice model for voice conversion without parallel data
AU - Wu, Jie
AU - Wu, Zhizheng
AU - Xie, Lei
N1 - Publisher Copyright:
© 2016 Asia Pacific Signal and Information Processing Association.
PY - 2017/1/17
Y1 - 2017/1/17
N2 - Recently, deep and/or recurrent neural networks (DNNs/RNNs) have been employed for voice conversion, and have significantly improved the performance of converted speech. However, DNNs/RNNs generally require a large amount of parallel training data (e.g., hundreds of utterances) from source and target speakers. It is expensive to collect such a large amount of data, and impossible in some applications, such as cross-lingual conversion. To solve this problem, we propose to use average voice model and i-vectors for long short-term memory (LSTM) based voice conversion, which does not require parallel data from source and target speakers. The average voice model is trained using other speakers' data, and the i-vectors, a compact vector representing the identities of source and target speakers, are extracted independently. Subjective evaluation has confirmed the effectiveness of the proposed approach.
AB - Recently, deep and/or recurrent neural networks (DNNs/RNNs) have been employed for voice conversion, and have significantly improved the performance of converted speech. However, DNNs/RNNs generally require a large amount of parallel training data (e.g., hundreds of utterances) from source and target speakers. It is expensive to collect such a large amount of data, and impossible in some applications, such as cross-lingual conversion. To solve this problem, we propose to use average voice model and i-vectors for long short-term memory (LSTM) based voice conversion, which does not require parallel data from source and target speakers. The average voice model is trained using other speakers' data, and the i-vectors, a compact vector representing the identities of source and target speakers, are extracted independently. Subjective evaluation has confirmed the effectiveness of the proposed approach.
KW - average voice model
KW - i-vector
KW - long short-term memory
KW - nonparallel training
KW - voice conversion
UR - http://www.scopus.com/inward/record.url?scp=85013851128&partnerID=8YFLogxK
U2 - 10.1109/APSIPA.2016.7820901
DO - 10.1109/APSIPA.2016.7820901
M3 - 会议稿件
AN - SCOPUS:85013851128
T3 - 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016
BT - 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016
Y2 - 13 December 2016 through 16 December 2016
ER -