Human–machine collaboration based sound event detection

Shengtong Ge; Zhiwen Yu; Fan Yang; Jiaqi Liu; Liang Wang

doi:10.1007/s42486-022-00091-9

Human–machine collaboration based sound event detection

Shengtong Ge, Zhiwen Yu, Fan Yang, Jiaqi Liu, Liang Wang

计算机学院

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 文章 › 同行评审

1 引用（Scopus）

摘要

Sound Event Detection (SED) is the task of detecting and demarcating the segments with specific semantics in audio recording. It has a promising application prospect in security monitoring, intelligent medical treatment, industrial production and so on. However, SED is still in the early stage of development and it faces many challenges, including the lack of accurately annotated data and the poor performance on detection due to the overlap of sound events. In view of the above problems, considering the intelligence of human beings and their flexibility and adaptability in the face of complex problems and changing environment, this paper proposes an approach of human–machine collaboration based SED (HMSED). In order to reduce the cost of labeling data, we first employ two CNN models with embedding-level attention pool module for weakly-labeled SED. Second, in order to improve the abilities of these two models alternately, we propose an end-to-end guided learning process for semi-supervised learning. Third, we use a group of median filters with adaptive window size in the post-processing of output probabilities of the model. Fourth, the model is adjusted and optimized by combining the results of machine recognition and manual annotation feedback. Based on HTML and JavaScript, an interactive annotation interface for HMSED is developed. And we do extensive exploratory experiments on the effects of human workload, model structure, hyperparameter and adaptive post-processing. The result shows that the HMSED is superior to some classical SED approaches.

源语言	英语
页（从-至）	158-171
页数	14
期刊	CCF Transactions on Pervasive Computing and Interaction
卷	4
期	2
DOI	https://doi.org/10.1007/s42486-022-00091-9
出版状态	已出版 - 6月 2022

访问文件

10.1007/s42486-022-00091-9

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{f169fcf4db5b4f1fa49247a34d399c7d,

title = "Human–machine collaboration based sound event detection",

abstract = "Sound Event Detection (SED) is the task of detecting and demarcating the segments with specific semantics in audio recording. It has a promising application prospect in security monitoring, intelligent medical treatment, industrial production and so on. However, SED is still in the early stage of development and it faces many challenges, including the lack of accurately annotated data and the poor performance on detection due to the overlap of sound events. In view of the above problems, considering the intelligence of human beings and their flexibility and adaptability in the face of complex problems and changing environment, this paper proposes an approach of human–machine collaboration based SED (HMSED). In order to reduce the cost of labeling data, we first employ two CNN models with embedding-level attention pool module for weakly-labeled SED. Second, in order to improve the abilities of these two models alternately, we propose an end-to-end guided learning process for semi-supervised learning. Third, we use a group of median filters with adaptive window size in the post-processing of output probabilities of the model. Fourth, the model is adjusted and optimized by combining the results of machine recognition and manual annotation feedback. Based on HTML and JavaScript, an interactive annotation interface for HMSED is developed. And we do extensive exploratory experiments on the effects of human workload, model structure, hyperparameter and adaptive post-processing. The result shows that the HMSED is superior to some classical SED approaches.",

keywords = "Deep learning, Human–machine collaboration, Semi-supervised learning, Sound event detection",

author = "Shengtong Ge and Zhiwen Yu and Fan Yang and Jiaqi Liu and Liang Wang",

note = "Publisher Copyright: {\textcopyright} 2022, China Computer Federation (CCF).",

year = "2022",

month = jun,

doi = "10.1007/s42486-022-00091-9",

language = "英语",

volume = "4",

pages = "158--171",

journal = "CCF Transactions on Pervasive Computing and Interaction",

issn = "2524-521X",

publisher = "Springer Verlag",

number = "2",

}

TY - JOUR

T1 - Human–machine collaboration based sound event detection

AU - Ge, Shengtong

AU - Yu, Zhiwen

AU - Yang, Fan

AU - Liu, Jiaqi

AU - Wang, Liang

PY - 2022/6

Y1 - 2022/6

N2 - Sound Event Detection (SED) is the task of detecting and demarcating the segments with specific semantics in audio recording. It has a promising application prospect in security monitoring, intelligent medical treatment, industrial production and so on. However, SED is still in the early stage of development and it faces many challenges, including the lack of accurately annotated data and the poor performance on detection due to the overlap of sound events. In view of the above problems, considering the intelligence of human beings and their flexibility and adaptability in the face of complex problems and changing environment, this paper proposes an approach of human–machine collaboration based SED (HMSED). In order to reduce the cost of labeling data, we first employ two CNN models with embedding-level attention pool module for weakly-labeled SED. Second, in order to improve the abilities of these two models alternately, we propose an end-to-end guided learning process for semi-supervised learning. Third, we use a group of median filters with adaptive window size in the post-processing of output probabilities of the model. Fourth, the model is adjusted and optimized by combining the results of machine recognition and manual annotation feedback. Based on HTML and JavaScript, an interactive annotation interface for HMSED is developed. And we do extensive exploratory experiments on the effects of human workload, model structure, hyperparameter and adaptive post-processing. The result shows that the HMSED is superior to some classical SED approaches.

AB - Sound Event Detection (SED) is the task of detecting and demarcating the segments with specific semantics in audio recording. It has a promising application prospect in security monitoring, intelligent medical treatment, industrial production and so on. However, SED is still in the early stage of development and it faces many challenges, including the lack of accurately annotated data and the poor performance on detection due to the overlap of sound events. In view of the above problems, considering the intelligence of human beings and their flexibility and adaptability in the face of complex problems and changing environment, this paper proposes an approach of human–machine collaboration based SED (HMSED). In order to reduce the cost of labeling data, we first employ two CNN models with embedding-level attention pool module for weakly-labeled SED. Second, in order to improve the abilities of these two models alternately, we propose an end-to-end guided learning process for semi-supervised learning. Third, we use a group of median filters with adaptive window size in the post-processing of output probabilities of the model. Fourth, the model is adjusted and optimized by combining the results of machine recognition and manual annotation feedback. Based on HTML and JavaScript, an interactive annotation interface for HMSED is developed. And we do extensive exploratory experiments on the effects of human workload, model structure, hyperparameter and adaptive post-processing. The result shows that the HMSED is superior to some classical SED approaches.

KW - Deep learning

KW - Human–machine collaboration

KW - Semi-supervised learning

KW - Sound event detection

UR - http://www.scopus.com/inward/record.url?scp=85124755978&partnerID=8YFLogxK

U2 - 10.1007/s42486-022-00091-9

DO - 10.1007/s42486-022-00091-9

M3 - 文章

AN - SCOPUS:85124755978

SN - 2524-521X

VL - 4

SP - 158

EP - 171

JO - CCF Transactions on Pervasive Computing and Interaction

JF - CCF Transactions on Pervasive Computing and Interaction

IS - 2

ER -

Human–machine collaboration based sound event detection

摘要

访问文件

其它文件与链接

指纹

引用此