摘要
Selective listening to different sound sources in complex acoustic scenes is still an urgent challenge for machine hearing. In this study, the separation of two sound sources from monaural mixtures is investigated. An end-to-end fully convolutional time-domain audio separation network (ConvTasNet) is trained by a universal dataset, which includes speech, environmental sounds, and music. After the best performing network, mixtures in the test dataset can obtain an average scale-invariant signal-to-distortion ratio improvement (SI-SDRi) of 11.70 dB, which is comparable with the human performance to separate natural sources. Except for the promising performance, the main contribution of our study is to reveal the underlying separation mechanisms of the network through a series of classical human auditory segregation experiments. Results show that without any biological modeling of the auditory system, the proposed network spontaneously mimics aspects of the auditory system to separate sources. Not only the frequency proximity and harmonicity principles for auditory scene analysis are spontaneously learned by such a pure statistical deep network, but also the frequency selectivity in high and low frequencies and the resolvability of harmonics are precisely simulated. The emergence of deep networks with similar behavior characteristics to human beings provides the possibility to develop a universal network that can be adapted to all scenes and achieve selective listening like the human ear. On the other hand, it also provides a new perspective to the modeling of the auditory system for other problems such as recognition and localization.
源语言 | 英语 |
---|---|
文章编号 | 108591 |
期刊 | Applied Acoustics |
卷 | 188 |
DOI | |
出版状态 | 已出版 - 1月 2022 |