Skip to main navigation Skip to search Skip to main content

Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization

  • Huan Zhao
  • , Li Zhang
  • , Yue Li
  • , Yannan Wang
  • , Hongji Wang
  • , Wei Rao
  • , Qing Wang
  • , Lei Xie
  • Northwestern Polytechnical University Xian
  • Tencent

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The scarcity of labeled audio-visual datasets is a constraint for training superior audio-visual speaker diarization systems. To improve the performance of audio-visual speaker diarization, we leverage pre-trained supervised and self-supervised speech models for audio-visual speaker diarization. Specifically, we adopt supervised (ResNet and ECAPA-TDNN) and self-supervised pre-trained models (WavLM and HuBERT) as the speaker and audio embedding extractors in an end-to-end audio-visual speaker diarization (AVSD) system. Then we explore the effectiveness of different frameworks, including Transformer, Conformer, and cross-attention mechanism, in the audio-visual decoder. To mitigate the degradation of performance caused by separate training, we jointly train the audio encoder, speaker encoder, and audio-visual decoder in the AVSD system. Experiments on the MISP dataset demonstrate that the proposed method achieves superior performance and obtained third place in MISP Challenge 2022.

Original languageEnglish
Title of host publicationMan-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings
EditorsJia Jia, Zhenhua Ling, Xie Chen, Ya Li, Zixing Zhang
PublisherSpringer Science and Business Media Deutschland GmbH
Pages265-275
Number of pages11
ISBN (Print)9789819706006
DOIs
StatePublished - 2024
Event18th National Conference on Man-Machine Speech Communication, NCMMSC 2023 - Suzhou, China
Duration: 8 Dec 202311 Dec 2023

Publication series

NameCommunications in Computer and Information Science
Volume2006
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference18th National Conference on Man-Machine Speech Communication, NCMMSC 2023
Country/TerritoryChina
CitySuzhou
Period8/12/2311/12/23

Keywords

  • audio-visual
  • joint traning
  • pre-trained model
  • speaker diarization

Fingerprint

Dive into the research topics of 'Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization'. Together they form a unique fingerprint.

Cite this