Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization

Huan Zhao; Li Zhang; Yue Li; Yannan Wang; Hongji Wang; Wei Rao; Qing Wang; Lei Xie

doi:10.1007/978-981-97-0601-3_23

Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization

Huan Zhao, Li Zhang, Yue Li, Yannan Wang, Hongji Wang, Wei Rao, Qing Wang, Lei Xie

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

摘要

The scarcity of labeled audio-visual datasets is a constraint for training superior audio-visual speaker diarization systems. To improve the performance of audio-visual speaker diarization, we leverage pre-trained supervised and self-supervised speech models for audio-visual speaker diarization. Specifically, we adopt supervised (ResNet and ECAPA-TDNN) and self-supervised pre-trained models (WavLM and HuBERT) as the speaker and audio embedding extractors in an end-to-end audio-visual speaker diarization (AVSD) system. Then we explore the effectiveness of different frameworks, including Transformer, Conformer, and cross-attention mechanism, in the audio-visual decoder. To mitigate the degradation of performance caused by separate training, we jointly train the audio encoder, speaker encoder, and audio-visual decoder in the AVSD system. Experiments on the MISP dataset demonstrate that the proposed method achieves superior performance and obtained third place in MISP Challenge 2022.

源语言	英语
主期刊名	Man-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings
编辑	Jia Jia, Zhenhua Ling, Xie Chen, Ya Li, Zixing Zhang
出版商	Springer Science and Business Media Deutschland GmbH
页	265-275
页数	11
ISBN（印刷版）	9789819706006
DOI	https://doi.org/10.1007/978-981-97-0601-3_23
出版状态	已出版 - 2024
活动	18th National Conference on Man-Machine Speech Communication, NCMMSC 2023 - Suzhou, 中国期限: 8 12月 2023 → 11 12月 2023

出版系列

姓名	Communications in Computer and Information Science
卷	2006
ISSN（印刷版）	1865-0929
ISSN（电子版）	1865-0937

会议

会议	18th National Conference on Man-Machine Speech Communication, NCMMSC 2023
国家/地区	中国
市	Suzhou
时期	8/12/23 → 11/12/23

访问文件

10.1007/978-981-97-0601-3_23

其它文件与链接

链接到 Scopus 的出版物

引用此

Zhao, H., Zhang, L., Li, Y., Wang, Y., Wang, H., Rao, W., Wang, Q., & Xie, L. (2024). Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization. 在 J. Jia, Z. Ling, X. Chen, Y. Li, & Z. Zhang (编辑), Man-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings (页码 265-275). (Communications in Computer and Information Science; 卷 2006). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-97-0601-3_23

Zhao, Huan ; Zhang, Li ; Li, Yue 等. / Joint Training or Not : An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization. Man-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings. 编辑 / Jia Jia ; Zhenhua Ling ; Xie Chen ; Ya Li ; Zixing Zhang. Springer Science and Business Media Deutschland GmbH, 2024. 页码 265-275 (Communications in Computer and Information Science).

@inproceedings{e833b90b74a34008993076e5581f9db1,

title = "Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization",

abstract = "The scarcity of labeled audio-visual datasets is a constraint for training superior audio-visual speaker diarization systems. To improve the performance of audio-visual speaker diarization, we leverage pre-trained supervised and self-supervised speech models for audio-visual speaker diarization. Specifically, we adopt supervised (ResNet and ECAPA-TDNN) and self-supervised pre-trained models (WavLM and HuBERT) as the speaker and audio embedding extractors in an end-to-end audio-visual speaker diarization (AVSD) system. Then we explore the effectiveness of different frameworks, including Transformer, Conformer, and cross-attention mechanism, in the audio-visual decoder. To mitigate the degradation of performance caused by separate training, we jointly train the audio encoder, speaker encoder, and audio-visual decoder in the AVSD system. Experiments on the MISP dataset demonstrate that the proposed method achieves superior performance and obtained third place in MISP Challenge 2022.",

keywords = "audio-visual, joint traning, pre-trained model, speaker diarization",

author = "Huan Zhao and Li Zhang and Yue Li and Yannan Wang and Hongji Wang and Wei Rao and Qing Wang and Lei Xie",

note = "Publisher Copyright: {\textcopyright} The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024.; 18th National Conference on Man-Machine Speech Communication, NCMMSC 2023 ; Conference date: 08-12-2023 Through 11-12-2023",

year = "2024",

doi = "10.1007/978-981-97-0601-3_23",

language = "英语",

isbn = "9789819706006",

series = "Communications in Computer and Information Science",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "265--275",

editor = "Jia Jia and Zhenhua Ling and Xie Chen and Ya Li and Zixing Zhang",

booktitle = "Man-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings",

}

Zhao, H, Zhang, L, Li, Y, Wang, Y, Wang, H, Rao, W, Wang, Q & Xie, L 2024, Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization. 在 J Jia, Z Ling, X Chen, Y Li & Z Zhang (编辑), Man-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings. Communications in Computer and Information Science, 卷 2006, Springer Science and Business Media Deutschland GmbH, 页码 265-275, 18th National Conference on Man-Machine Speech Communication, NCMMSC 2023, Suzhou, 中国, 8/12/23. https://doi.org/10.1007/978-981-97-0601-3_23

Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization. / Zhao, Huan; Zhang, Li; Li, Yue 等.
Man-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings. 编辑 / Jia Jia; Zhenhua Ling; Xie Chen; Ya Li; Zixing Zhang. Springer Science and Business Media Deutschland GmbH, 2024. 页码 265-275 (Communications in Computer and Information Science; 卷 2006).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Joint Training or Not

T2 - 18th National Conference on Man-Machine Speech Communication, NCMMSC 2023

AU - Zhao, Huan

AU - Zhang, Li

AU - Li, Yue

AU - Wang, Yannan

AU - Wang, Hongji

AU - Rao, Wei

AU - Wang, Qing

AU - Xie, Lei

N1 - Publisher Copyright: © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024.

PY - 2024

Y1 - 2024

N2 - The scarcity of labeled audio-visual datasets is a constraint for training superior audio-visual speaker diarization systems. To improve the performance of audio-visual speaker diarization, we leverage pre-trained supervised and self-supervised speech models for audio-visual speaker diarization. Specifically, we adopt supervised (ResNet and ECAPA-TDNN) and self-supervised pre-trained models (WavLM and HuBERT) as the speaker and audio embedding extractors in an end-to-end audio-visual speaker diarization (AVSD) system. Then we explore the effectiveness of different frameworks, including Transformer, Conformer, and cross-attention mechanism, in the audio-visual decoder. To mitigate the degradation of performance caused by separate training, we jointly train the audio encoder, speaker encoder, and audio-visual decoder in the AVSD system. Experiments on the MISP dataset demonstrate that the proposed method achieves superior performance and obtained third place in MISP Challenge 2022.

AB - The scarcity of labeled audio-visual datasets is a constraint for training superior audio-visual speaker diarization systems. To improve the performance of audio-visual speaker diarization, we leverage pre-trained supervised and self-supervised speech models for audio-visual speaker diarization. Specifically, we adopt supervised (ResNet and ECAPA-TDNN) and self-supervised pre-trained models (WavLM and HuBERT) as the speaker and audio embedding extractors in an end-to-end audio-visual speaker diarization (AVSD) system. Then we explore the effectiveness of different frameworks, including Transformer, Conformer, and cross-attention mechanism, in the audio-visual decoder. To mitigate the degradation of performance caused by separate training, we jointly train the audio encoder, speaker encoder, and audio-visual decoder in the AVSD system. Experiments on the MISP dataset demonstrate that the proposed method achieves superior performance and obtained third place in MISP Challenge 2022.

KW - audio-visual

KW - joint traning

KW - pre-trained model

KW - speaker diarization

UR - http://www.scopus.com/inward/record.url?scp=85186636341&partnerID=8YFLogxK

U2 - 10.1007/978-981-97-0601-3_23

DO - 10.1007/978-981-97-0601-3_23

M3 - 会议稿件

AN - SCOPUS:85186636341

SN - 9789819706006

T3 - Communications in Computer and Information Science

SP - 265

EP - 275

BT - Man-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings

A2 - Jia, Jia

A2 - Ling, Zhenhua

A2 - Chen, Xie

A2 - Li, Ya

A2 - Zhang, Zixing

PB - Springer Science and Business Media Deutschland GmbH

Y2 - 8 December 2023 through 11 December 2023

ER -

Zhao H, Zhang L, Li Y, Wang Y, Wang H, Rao W 等. Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization. 在 Jia J, Ling Z, Chen X, Li Y, Zhang Z, 编辑, Man-Machine Speech Communication - 18th National Conference, NCMMSC 2023, Proceedings. Springer Science and Business Media Deutschland GmbH. 2024. 页码 265-275. (Communications in Computer and Information Science). doi: 10.1007/978-981-97-0601-3_23

Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此