Optimizing the Perceptual Quality of Time-Domain Speech Enhancement with Reinforcement Learning

Xiang Hao; Chenglin Xu; Lei Xie; Haizhou Li

doi:10.26599/TST.2021.9010048

Optimizing the Perceptual Quality of Time-Domain Speech Enhancement with Reinforcement Learning

Xiang Hao, Chenglin Xu, Lei Xie, Haizhou Li

School of Computer Science

Research output: Contribution to journal › Article › peer-review

8 Scopus citations

Abstract

In neural speech enhancement, a mismatch exists between the training objective, i.e., Mean-Square Error (MSE), and perceptual quality evaluation metrics, i.e., perceptual evaluation of speech quality and short-time objective intelligibility. We propose a novel reinforcement learning algorithm and network architecture, which incorporate a non-differentiable perceptual quality evaluation metric into the objective function using a dynamic filter module. Unlike the traditional dynamic filter implementation that directly generates a convolution kernel, we use a filter generation agent to predict the probability density function of a multivariate Gaussian distribution, from which we sample the convolution kernel. Experimental results show that the proposed reinforcement learning method clearly improves the perceptual quality over other supervised learning methods with the MSE objective function.

Original language	English
Pages (from-to)	939-947
Number of pages	9
Journal	Tsinghua Science and Technology
Volume	27
Issue number	6
DOIs	https://doi.org/10.26599/TST.2021.9010048
State	Published - 1 Dec 2022

Keywords

dynamic filter
neural networks
reinforcement learning
speech enhancement

Access to Document

10.26599/TST.2021.9010048

Cite this

@article{ab5bb63e93df44c9b715e0c36a942c66,

title = "Optimizing the Perceptual Quality of Time-Domain Speech Enhancement with Reinforcement Learning",

abstract = "In neural speech enhancement, a mismatch exists between the training objective, i.e., Mean-Square Error (MSE), and perceptual quality evaluation metrics, i.e., perceptual evaluation of speech quality and short-time objective intelligibility. We propose a novel reinforcement learning algorithm and network architecture, which incorporate a non-differentiable perceptual quality evaluation metric into the objective function using a dynamic filter module. Unlike the traditional dynamic filter implementation that directly generates a convolution kernel, we use a filter generation agent to predict the probability density function of a multivariate Gaussian distribution, from which we sample the convolution kernel. Experimental results show that the proposed reinforcement learning method clearly improves the perceptual quality over other supervised learning methods with the MSE objective function.",

keywords = "dynamic filter, neural networks, reinforcement learning, speech enhancement",

author = "Xiang Hao and Chenglin Xu and Lei Xie and Haizhou Li",

note = "Publisher Copyright: {\textcopyright} 1996-2012 Tsinghua University Press.",

year = "2022",

month = dec,

day = "1",

doi = "10.26599/TST.2021.9010048",

language = "英语",

volume = "27",

pages = "939--947",

journal = "Tsinghua Science and Technology",

issn = "1007-0214",

publisher = "Tsinghua University",

number = "6",

}

TY - JOUR

T1 - Optimizing the Perceptual Quality of Time-Domain Speech Enhancement with Reinforcement Learning

AU - Hao, Xiang

AU - Xu, Chenglin

AU - Xie, Lei

AU - Li, Haizhou

PY - 2022/12/1

Y1 - 2022/12/1

N2 - In neural speech enhancement, a mismatch exists between the training objective, i.e., Mean-Square Error (MSE), and perceptual quality evaluation metrics, i.e., perceptual evaluation of speech quality and short-time objective intelligibility. We propose a novel reinforcement learning algorithm and network architecture, which incorporate a non-differentiable perceptual quality evaluation metric into the objective function using a dynamic filter module. Unlike the traditional dynamic filter implementation that directly generates a convolution kernel, we use a filter generation agent to predict the probability density function of a multivariate Gaussian distribution, from which we sample the convolution kernel. Experimental results show that the proposed reinforcement learning method clearly improves the perceptual quality over other supervised learning methods with the MSE objective function.

AB - In neural speech enhancement, a mismatch exists between the training objective, i.e., Mean-Square Error (MSE), and perceptual quality evaluation metrics, i.e., perceptual evaluation of speech quality and short-time objective intelligibility. We propose a novel reinforcement learning algorithm and network architecture, which incorporate a non-differentiable perceptual quality evaluation metric into the objective function using a dynamic filter module. Unlike the traditional dynamic filter implementation that directly generates a convolution kernel, we use a filter generation agent to predict the probability density function of a multivariate Gaussian distribution, from which we sample the convolution kernel. Experimental results show that the proposed reinforcement learning method clearly improves the perceptual quality over other supervised learning methods with the MSE objective function.

KW - dynamic filter

KW - neural networks

KW - reinforcement learning

KW - speech enhancement

UR - http://www.scopus.com/inward/record.url?scp=85133706189&partnerID=8YFLogxK

U2 - 10.26599/TST.2021.9010048

DO - 10.26599/TST.2021.9010048

M3 - 文章

AN - SCOPUS:85133706189

SN - 1007-0214

VL - 27

SP - 939

EP - 947

JO - Tsinghua Science and Technology

JF - Tsinghua Science and Technology

IS - 6

ER -

Optimizing the Perceptual Quality of Time-Domain Speech Enhancement with Reinforcement Learning

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this