TY - JOUR
T1 - Human–machine collaboration based sound event detection
AU - Ge, Shengtong
AU - Yu, Zhiwen
AU - Yang, Fan
AU - Liu, Jiaqi
AU - Wang, Liang
N1 - Publisher Copyright:
© 2022, China Computer Federation (CCF).
PY - 2022/6
Y1 - 2022/6
N2 - Sound Event Detection (SED) is the task of detecting and demarcating the segments with specific semantics in audio recording. It has a promising application prospect in security monitoring, intelligent medical treatment, industrial production and so on. However, SED is still in the early stage of development and it faces many challenges, including the lack of accurately annotated data and the poor performance on detection due to the overlap of sound events. In view of the above problems, considering the intelligence of human beings and their flexibility and adaptability in the face of complex problems and changing environment, this paper proposes an approach of human–machine collaboration based SED (HMSED). In order to reduce the cost of labeling data, we first employ two CNN models with embedding-level attention pool module for weakly-labeled SED. Second, in order to improve the abilities of these two models alternately, we propose an end-to-end guided learning process for semi-supervised learning. Third, we use a group of median filters with adaptive window size in the post-processing of output probabilities of the model. Fourth, the model is adjusted and optimized by combining the results of machine recognition and manual annotation feedback. Based on HTML and JavaScript, an interactive annotation interface for HMSED is developed. And we do extensive exploratory experiments on the effects of human workload, model structure, hyperparameter and adaptive post-processing. The result shows that the HMSED is superior to some classical SED approaches.
AB - Sound Event Detection (SED) is the task of detecting and demarcating the segments with specific semantics in audio recording. It has a promising application prospect in security monitoring, intelligent medical treatment, industrial production and so on. However, SED is still in the early stage of development and it faces many challenges, including the lack of accurately annotated data and the poor performance on detection due to the overlap of sound events. In view of the above problems, considering the intelligence of human beings and their flexibility and adaptability in the face of complex problems and changing environment, this paper proposes an approach of human–machine collaboration based SED (HMSED). In order to reduce the cost of labeling data, we first employ two CNN models with embedding-level attention pool module for weakly-labeled SED. Second, in order to improve the abilities of these two models alternately, we propose an end-to-end guided learning process for semi-supervised learning. Third, we use a group of median filters with adaptive window size in the post-processing of output probabilities of the model. Fourth, the model is adjusted and optimized by combining the results of machine recognition and manual annotation feedback. Based on HTML and JavaScript, an interactive annotation interface for HMSED is developed. And we do extensive exploratory experiments on the effects of human workload, model structure, hyperparameter and adaptive post-processing. The result shows that the HMSED is superior to some classical SED approaches.
KW - Deep learning
KW - Human–machine collaboration
KW - Semi-supervised learning
KW - Sound event detection
UR - http://www.scopus.com/inward/record.url?scp=85124755978&partnerID=8YFLogxK
U2 - 10.1007/s42486-022-00091-9
DO - 10.1007/s42486-022-00091-9
M3 - 文章
AN - SCOPUS:85124755978
SN - 2524-521X
VL - 4
SP - 158
EP - 171
JO - CCF Transactions on Pervasive Computing and Interaction
JF - CCF Transactions on Pervasive Computing and Interaction
IS - 2
ER -