Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization

Huan Zhao, Li Zhang, Yue Li, Yannan Wang, Hongji Wang, Wei Rao, Qing Wang, Lei Xie

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

The scarcity of labeled audio-visual datasets is a constraint for training superior audio-visual speaker diarization systems. To improve the performance of audio-visual speaker diarization, we leverage pre-trained supervised and self-supervised speech models for audio-visual speaker diarization. Specifically, we adopt supervised (ResNet and ECAPA-TDNN) and self-supervised pre-trained models (WavLM and HuBERT) as the speaker and audio embedding extractors in an end-to-end audio-visual speaker diarization (AVSD) system. Then we explore the effectiveness of different frameworks, including Transformer, Conformer, and cross-attention mechanism, in the audio-visual decoder. To mitigate the degradation of performance caused by separate training, we jointly train the audio encoder, speaker encoder, and audio-visual decoder in the AVSD system. Experiments on the MISP dataset demonstrate that the proposed method achieves superior performance and obtained third place in MISP Challenge 2022.

源语言英语
主期刊名Man-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings
编辑Jia Jia, Zhenhua Ling, Xie Chen, Ya Li, Zixing Zhang
出版商Springer Science and Business Media Deutschland GmbH
265-275
页数11
ISBN(印刷版)9789819706006
DOI
出版状态已出版 - 2024
活动18th National Conference on Man-Machine Speech Communication, NCMMSC 2023 - Suzhou, 中国
期限: 8 12月 202311 12月 2023

出版系列

姓名Communications in Computer and Information Science
2006
ISSN(印刷版)1865-0929
ISSN(电子版)1865-0937

会议

会议18th National Conference on Man-Machine Speech Communication, NCMMSC 2023
国家/地区中国
Suzhou
时期8/12/2311/12/23

指纹

探究 'Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization' 的科研主题。它们共同构成独一无二的指纹。

引用此