On the use of I-vectors and average voice model for voice conversion without parallel data

Jie Wu, Zhizheng Wu, Lei Xie

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

16 Scopus citations

Abstract

Recently, deep and/or recurrent neural networks (DNNs/RNNs) have been employed for voice conversion, and have significantly improved the performance of converted speech. However, DNNs/RNNs generally require a large amount of parallel training data (e.g., hundreds of utterances) from source and target speakers. It is expensive to collect such a large amount of data, and impossible in some applications, such as cross-lingual conversion. To solve this problem, we propose to use average voice model and i-vectors for long short-term memory (LSTM) based voice conversion, which does not require parallel data from source and target speakers. The average voice model is trained using other speakers' data, and the i-vectors, a compact vector representing the identities of source and target speakers, are extracted independently. Subjective evaluation has confirmed the effectiveness of the proposed approach.

Original languageEnglish
Title of host publication2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9789881476821
DOIs
StatePublished - 17 Jan 2017
Event2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016 - Jeju, Korea, Republic of
Duration: 13 Dec 201616 Dec 2016

Publication series

Name2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016

Conference

Conference2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016
Country/TerritoryKorea, Republic of
CityJeju
Period13/12/1616/12/16

Keywords

  • average voice model
  • i-vector
  • long short-term memory
  • nonparallel training
  • voice conversion

Fingerprint

Dive into the research topics of 'On the use of I-vectors and average voice model for voice conversion without parallel data'. Together they form a unique fingerprint.

Cite this