Optimizing the Perceptual Quality of Time-Domain Speech Enhancement with Reinforcement Learning

Xiang Hao; Chenglin Xu; Lei Xie; Haizhou Li

doi:10.26599/TST.2021.9010048

Optimizing the Perceptual Quality of Time-Domain Speech Enhancement with Reinforcement Learning

Xiang Hao, Chenglin Xu, Lei Xie, Haizhou Li

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

8 引用（Scopus）

摘要

In neural speech enhancement, a mismatch exists between the training objective, i.e., Mean-Square Error (MSE), and perceptual quality evaluation metrics, i.e., perceptual evaluation of speech quality and short-time objective intelligibility. We propose a novel reinforcement learning algorithm and network architecture, which incorporate a non-differentiable perceptual quality evaluation metric into the objective function using a dynamic filter module. Unlike the traditional dynamic filter implementation that directly generates a convolution kernel, we use a filter generation agent to predict the probability density function of a multivariate Gaussian distribution, from which we sample the convolution kernel. Experimental results show that the proposed reinforcement learning method clearly improves the perceptual quality over other supervised learning methods with the MSE objective function.

源语言	英语
页（从-至）	939-947
页数	9
期刊	Tsinghua Science and Technology
卷	27
期	6
DOI	https://doi.org/10.26599/TST.2021.9010048
出版状态	已出版 - 1 12月 2022

访问文件

10.26599/TST.2021.9010048

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{ab5bb63e93df44c9b715e0c36a942c66,

title = "Optimizing the Perceptual Quality of Time-Domain Speech Enhancement with Reinforcement Learning",

abstract = "In neural speech enhancement, a mismatch exists between the training objective, i.e., Mean-Square Error (MSE), and perceptual quality evaluation metrics, i.e., perceptual evaluation of speech quality and short-time objective intelligibility. We propose a novel reinforcement learning algorithm and network architecture, which incorporate a non-differentiable perceptual quality evaluation metric into the objective function using a dynamic filter module. Unlike the traditional dynamic filter implementation that directly generates a convolution kernel, we use a filter generation agent to predict the probability density function of a multivariate Gaussian distribution, from which we sample the convolution kernel. Experimental results show that the proposed reinforcement learning method clearly improves the perceptual quality over other supervised learning methods with the MSE objective function.",

keywords = "dynamic filter, neural networks, reinforcement learning, speech enhancement",

author = "Xiang Hao and Chenglin Xu and Lei Xie and Haizhou Li",

note = "Publisher Copyright: {\textcopyright} 1996-2012 Tsinghua University Press.",

year = "2022",

month = dec,

day = "1",

doi = "10.26599/TST.2021.9010048",

language = "英语",

volume = "27",

pages = "939--947",

journal = "Tsinghua Science and Technology",

issn = "1007-0214",

publisher = "Tsinghua University",

number = "6",

}

TY - JOUR

T1 - Optimizing the Perceptual Quality of Time-Domain Speech Enhancement with Reinforcement Learning

AU - Hao, Xiang

AU - Xu, Chenglin

AU - Xie, Lei

AU - Li, Haizhou

PY - 2022/12/1

Y1 - 2022/12/1

N2 - In neural speech enhancement, a mismatch exists between the training objective, i.e., Mean-Square Error (MSE), and perceptual quality evaluation metrics, i.e., perceptual evaluation of speech quality and short-time objective intelligibility. We propose a novel reinforcement learning algorithm and network architecture, which incorporate a non-differentiable perceptual quality evaluation metric into the objective function using a dynamic filter module. Unlike the traditional dynamic filter implementation that directly generates a convolution kernel, we use a filter generation agent to predict the probability density function of a multivariate Gaussian distribution, from which we sample the convolution kernel. Experimental results show that the proposed reinforcement learning method clearly improves the perceptual quality over other supervised learning methods with the MSE objective function.

AB - In neural speech enhancement, a mismatch exists between the training objective, i.e., Mean-Square Error (MSE), and perceptual quality evaluation metrics, i.e., perceptual evaluation of speech quality and short-time objective intelligibility. We propose a novel reinforcement learning algorithm and network architecture, which incorporate a non-differentiable perceptual quality evaluation metric into the objective function using a dynamic filter module. Unlike the traditional dynamic filter implementation that directly generates a convolution kernel, we use a filter generation agent to predict the probability density function of a multivariate Gaussian distribution, from which we sample the convolution kernel. Experimental results show that the proposed reinforcement learning method clearly improves the perceptual quality over other supervised learning methods with the MSE objective function.

KW - dynamic filter

KW - neural networks

KW - reinforcement learning

KW - speech enhancement

UR - http://www.scopus.com/inward/record.url?scp=85133706189&partnerID=8YFLogxK

U2 - 10.26599/TST.2021.9010048

DO - 10.26599/TST.2021.9010048

M3 - 文章

AN - SCOPUS:85133706189

SN - 1007-0214

VL - 27

SP - 939

EP - 947

JO - Tsinghua Science and Technology

JF - Tsinghua Science and Technology

IS - 6

ER -

Optimizing the Perceptual Quality of Time-Domain Speech Enhancement with Reinforcement Learning

摘要

访问文件

其它文件与链接

指纹

引用此