SCDNet: Self-supervised Learning Feature based Speaker Change Detection

Yue Li; Xinsheng Wang; Li Zhang; Lei Xie

doi:10.21437/Interspeech.2024-752

SCDNet: Self-supervised Learning Feature based Speaker Change Detection

Yue Li, Xinsheng Wang, Li Zhang, Lei Xie

计算机学院

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 会议文章 › 同行评审

摘要

Speaker Change Detection (SCD) is to identify boundaries among speakers in a conversation. Motivated by the success of fine-tuning wav2vec 2.0 models for the SCD task, a further investigation of self-supervised learning (SSL) features for SCD is conducted in this work. Specifically, an SCD model, named SCDNet, is proposed. With this model, various state-of-the-art SSL models, including Hubert, wav2vec 2.0, and WavLm are investigated. To discern the most potent layer of SSL models for SCD, a learnable weighting method is employed to analyze the effectiveness of intermediate representations. Additionally, a fine-tuning-based approach is also implemented to further compare the characteristics of SSL models in the SCD task. Furthermore, a contrastive learning method is proposed to mitigate the overfitting tendencies in the training of both the fine-tuning-based method and SCDNet. Experiments showcase the superiority of WavLm in the SCD task and also demonstrate the good design of SCDNet.

源语言	英语
页（从-至）	4718-4722
页数	5
期刊	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOI	https://doi.org/10.21437/Interspeech.2024-752
出版状态	已出版 - 2024
活动	25th Interspeech Conferece 2024 - Kos Island, 希腊期限: 1 9月 2024 → 5 9月 2024

访问文件

10.21437/Interspeech.2024-752

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{9b5ebdc2c8c647baa227131d6ea6b906,

title = "SCDNet: Self-supervised Learning Feature based Speaker Change Detection",

abstract = "Speaker Change Detection (SCD) is to identify boundaries among speakers in a conversation. Motivated by the success of fine-tuning wav2vec 2.0 models for the SCD task, a further investigation of self-supervised learning (SSL) features for SCD is conducted in this work. Specifically, an SCD model, named SCDNet, is proposed. With this model, various state-of-the-art SSL models, including Hubert, wav2vec 2.0, and WavLm are investigated. To discern the most potent layer of SSL models for SCD, a learnable weighting method is employed to analyze the effectiveness of intermediate representations. Additionally, a fine-tuning-based approach is also implemented to further compare the characteristics of SSL models in the SCD task. Furthermore, a contrastive learning method is proposed to mitigate the overfitting tendencies in the training of both the fine-tuning-based method and SCDNet. Experiments showcase the superiority of WavLm in the SCD task and also demonstrate the good design of SCDNet.",

keywords = "contrastive learning, self-supervised models, speaker change detection",

author = "Yue Li and Xinsheng Wang and Li Zhang and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2024 International Speech Communication Association. All rights reserved.; 25th Interspeech Conferece 2024 ; Conference date: 01-09-2024 Through 05-09-2024",

year = "2024",

doi = "10.21437/Interspeech.2024-752",

language = "英语",

pages = "4718--4722",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - SCDNet

T2 - 25th Interspeech Conferece 2024

AU - Li, Yue

AU - Wang, Xinsheng

AU - Zhang, Li

AU - Xie, Lei

PY - 2024

Y1 - 2024

N2 - Speaker Change Detection (SCD) is to identify boundaries among speakers in a conversation. Motivated by the success of fine-tuning wav2vec 2.0 models for the SCD task, a further investigation of self-supervised learning (SSL) features for SCD is conducted in this work. Specifically, an SCD model, named SCDNet, is proposed. With this model, various state-of-the-art SSL models, including Hubert, wav2vec 2.0, and WavLm are investigated. To discern the most potent layer of SSL models for SCD, a learnable weighting method is employed to analyze the effectiveness of intermediate representations. Additionally, a fine-tuning-based approach is also implemented to further compare the characteristics of SSL models in the SCD task. Furthermore, a contrastive learning method is proposed to mitigate the overfitting tendencies in the training of both the fine-tuning-based method and SCDNet. Experiments showcase the superiority of WavLm in the SCD task and also demonstrate the good design of SCDNet.

AB - Speaker Change Detection (SCD) is to identify boundaries among speakers in a conversation. Motivated by the success of fine-tuning wav2vec 2.0 models for the SCD task, a further investigation of self-supervised learning (SSL) features for SCD is conducted in this work. Specifically, an SCD model, named SCDNet, is proposed. With this model, various state-of-the-art SSL models, including Hubert, wav2vec 2.0, and WavLm are investigated. To discern the most potent layer of SSL models for SCD, a learnable weighting method is employed to analyze the effectiveness of intermediate representations. Additionally, a fine-tuning-based approach is also implemented to further compare the characteristics of SSL models in the SCD task. Furthermore, a contrastive learning method is proposed to mitigate the overfitting tendencies in the training of both the fine-tuning-based method and SCDNet. Experiments showcase the superiority of WavLm in the SCD task and also demonstrate the good design of SCDNet.

KW - contrastive learning

KW - self-supervised models

KW - speaker change detection

UR - http://www.scopus.com/inward/record.url?scp=85214827215&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2024-752

DO - 10.21437/Interspeech.2024-752

M3 - 会议文章

AN - SCOPUS:85214827215

SN - 2308-457X

SP - 4718

EP - 4722

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Y2 - 1 September 2024 through 5 September 2024

ER -

SCDNet: Self-supervised Learning Feature based Speaker Change Detection

摘要

访问文件

其它文件与链接

指纹

引用此