Pre-Alignment guided attention for improving training efficiency and model stability in end-To-end speech synthesis

Xiaolian Zhu; Yuchao Zhang; Shan Yang; Liumeng Xue; Lei Xie

doi:10.1109/ACCESS.2019.2914149

Pre-Alignment guided attention for improving training efficiency and model stability in end-To-end speech synthesis

Xiaolian Zhu, Yuchao Zhang, Shan Yang, Liumeng Xue, Lei Xie

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

26 引用（Scopus）

摘要

Recently, end-To-end (E2E) neural text-To-speech systems, such as Tacotron2, have begun to surpass the traditional multi-stage hand-engineered systems, with both simplified system building pipelines and high-quality speech. With a unique encoder-decoder neural structure, the Tacotron2 system no longer needs separately learned text analysis front-end, duration model, acoustic model, and audio synthesis module. The key of such a system lies in the attention mechanism, which learns an alignment between the encoder and the decoder, serving as an implicit duration model bridging the text sequence and the acoustic sequence. However, attention learning suffers from low training efficiency and model instability problems, which hinder the E2E approaches from wide deployment. In this paper, we address the problems and propose a novel pre-Alignment guided attention learning approach. Specifically, we inject handy prior knowledge-Accurate phoneme durations-in the neural network loss function to bias the attention learning to the desired direction more accurately. The explicit time alignment between an audio recording and its corresponding phoneme sequence can be achieved by forced-Alignment from an automatic speech recognizer (ASR). The experiments show that the proposed pre-Alignment guided (PAG) attention approach can significantly improve training efficiency and model stability. More specifically, the PAG updated version of Tacotron2 can quickly obtain the attention alignment using only 500~\langle text, audio \rangle pairs, which is apparently not possible for the original Tacotron2. A series of subjective experiments also show that the PAG-Tacotron2 approach can synthesize more stable and natural speech.

源语言	英语
文章编号	8703406
页（从-至）	65955-65964
页数	10
期刊	IEEE Access
卷	7
DOI	https://doi.org/10.1109/ACCESS.2019.2914149
出版状态	已出版 - 2019

访问文件

10.1109/ACCESS.2019.2914149

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{1779294da9c84037b6a00c866473347a,

title = "Pre-Alignment guided attention for improving training efficiency and model stability in end-To-end speech synthesis",

abstract = "Recently, end-To-end (E2E) neural text-To-speech systems, such as Tacotron2, have begun to surpass the traditional multi-stage hand-engineered systems, with both simplified system building pipelines and high-quality speech. With a unique encoder-decoder neural structure, the Tacotron2 system no longer needs separately learned text analysis front-end, duration model, acoustic model, and audio synthesis module. The key of such a system lies in the attention mechanism, which learns an alignment between the encoder and the decoder, serving as an implicit duration model bridging the text sequence and the acoustic sequence. However, attention learning suffers from low training efficiency and model instability problems, which hinder the E2E approaches from wide deployment. In this paper, we address the problems and propose a novel pre-Alignment guided attention learning approach. Specifically, we inject handy prior knowledge-Accurate phoneme durations-in the neural network loss function to bias the attention learning to the desired direction more accurately. The explicit time alignment between an audio recording and its corresponding phoneme sequence can be achieved by forced-Alignment from an automatic speech recognizer (ASR). The experiments show that the proposed pre-Alignment guided (PAG) attention approach can significantly improve training efficiency and model stability. More specifically, the PAG updated version of Tacotron2 can quickly obtain the attention alignment using only 500~\langle text, audio \rangle pairs, which is apparently not possible for the original Tacotron2. A series of subjective experiments also show that the PAG-Tacotron2 approach can synthesize more stable and natural speech.",

keywords = "alignment loss, Attention, model stability, speech synthesis, training efficiency",

author = "Xiaolian Zhu and Yuchao Zhang and Shan Yang and Liumeng Xue and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2013 IEEE.",

year = "2019",

doi = "10.1109/ACCESS.2019.2914149",

language = "英语",

volume = "7",

pages = "65955--65964",

journal = "IEEE Access",

issn = "2169-3536",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Pre-Alignment guided attention for improving training efficiency and model stability in end-To-end speech synthesis

AU - Zhu, Xiaolian

AU - Zhang, Yuchao

AU - Yang, Shan

AU - Xue, Liumeng

AU - Xie, Lei

PY - 2019

Y1 - 2019

N2 - Recently, end-To-end (E2E) neural text-To-speech systems, such as Tacotron2, have begun to surpass the traditional multi-stage hand-engineered systems, with both simplified system building pipelines and high-quality speech. With a unique encoder-decoder neural structure, the Tacotron2 system no longer needs separately learned text analysis front-end, duration model, acoustic model, and audio synthesis module. The key of such a system lies in the attention mechanism, which learns an alignment between the encoder and the decoder, serving as an implicit duration model bridging the text sequence and the acoustic sequence. However, attention learning suffers from low training efficiency and model instability problems, which hinder the E2E approaches from wide deployment. In this paper, we address the problems and propose a novel pre-Alignment guided attention learning approach. Specifically, we inject handy prior knowledge-Accurate phoneme durations-in the neural network loss function to bias the attention learning to the desired direction more accurately. The explicit time alignment between an audio recording and its corresponding phoneme sequence can be achieved by forced-Alignment from an automatic speech recognizer (ASR). The experiments show that the proposed pre-Alignment guided (PAG) attention approach can significantly improve training efficiency and model stability. More specifically, the PAG updated version of Tacotron2 can quickly obtain the attention alignment using only 500~\langle text, audio \rangle pairs, which is apparently not possible for the original Tacotron2. A series of subjective experiments also show that the PAG-Tacotron2 approach can synthesize more stable and natural speech.

AB - Recently, end-To-end (E2E) neural text-To-speech systems, such as Tacotron2, have begun to surpass the traditional multi-stage hand-engineered systems, with both simplified system building pipelines and high-quality speech. With a unique encoder-decoder neural structure, the Tacotron2 system no longer needs separately learned text analysis front-end, duration model, acoustic model, and audio synthesis module. The key of such a system lies in the attention mechanism, which learns an alignment between the encoder and the decoder, serving as an implicit duration model bridging the text sequence and the acoustic sequence. However, attention learning suffers from low training efficiency and model instability problems, which hinder the E2E approaches from wide deployment. In this paper, we address the problems and propose a novel pre-Alignment guided attention learning approach. Specifically, we inject handy prior knowledge-Accurate phoneme durations-in the neural network loss function to bias the attention learning to the desired direction more accurately. The explicit time alignment between an audio recording and its corresponding phoneme sequence can be achieved by forced-Alignment from an automatic speech recognizer (ASR). The experiments show that the proposed pre-Alignment guided (PAG) attention approach can significantly improve training efficiency and model stability. More specifically, the PAG updated version of Tacotron2 can quickly obtain the attention alignment using only 500~\langle text, audio \rangle pairs, which is apparently not possible for the original Tacotron2. A series of subjective experiments also show that the PAG-Tacotron2 approach can synthesize more stable and natural speech.

KW - alignment loss

KW - Attention

KW - model stability

KW - speech synthesis

KW - training efficiency

UR - http://www.scopus.com/inward/record.url?scp=85067292156&partnerID=8YFLogxK

U2 - 10.1109/ACCESS.2019.2914149

DO - 10.1109/ACCESS.2019.2914149

M3 - 文章

AN - SCOPUS:85067292156

SN - 2169-3536

VL - 7

SP - 65955

EP - 65964

JO - IEEE Access

JF - IEEE Access

M1 - 8703406

ER -

Pre-Alignment guided attention for improving training efficiency and model stability in end-To-end speech synthesis

摘要

访问文件

其它文件与链接

指纹

引用此