Whisper-SV: Adapting Whisper for low-data-resource speaker verification

Li Zhang; Ning Jiang; Qing Wang; Yue Li; Quan Lu; Lei Xie

doi:10.1016/j.specom.2024.103103

Whisper-SV: Adapting Whisper for low-data-resource speaker verification

Li Zhang, Ning Jiang, Qing Wang, Yue Li, Quan Lu, Lei Xie

School of Computer Science

Research output: Contribution to journal › Article › peer-review

4 Scopus citations

Abstract

Trained on 680,000 h of massive speech data, Whisper is a multitasking, multilingual speech foundation model demonstrating superior performance in automatic speech recognition, translation, and language identification. However, its applicability in speaker verification (SV) tasks remains unexplored, particularly in low-data-resource scenarios where labeled speaker data in specific domains are limited. To fill this gap, we propose a lightweight adaptor framework to boost SV with Whisper, namely Whisper-SV. Given that Whisper is not specifically optimized for SV tasks, we introduce a representation selection module to quantify the speaker-specific characteristics contained in each layer of Whisper and select the top-k layers with prominent discriminative speaker features. To aggregate pivotal speaker-related features while diminishing non-speaker redundancies across the selected top-k distinct layers of Whisper, we design a multi-layer aggregation module in Whisper-SV to integrate multi-layer representations into a singular, compacted representation for SV. In the multi-layer aggregation module, we employ convolutional layers with shortcut connections among different layers to refine speaker characteristics derived from multi-layer representations from Whisper. In addition, an attention aggregation layer is used to reduce non-speaker interference and amplify speaker-specific cues for SV tasks. Finally, a simple classification module is used for speaker classification. Experiments on VoxCeleb1, FFSVC, and IMSV datasets demonstrate that Whisper-SV achieves EER/minDCF of 2.22%/0.307, 6.14%/0.488, and 7.50%/0.582, respectively, showing superior performance in low-data-resource SV scenarios.

Original language	English
Article number	103103
Journal	Speech Communication
Volume	163
DOIs	https://doi.org/10.1016/j.specom.2024.103103
State	Published - Sep 2024

Keywords

Adaptor
Low-data-resource
Speaker verification
Whisper

Access to Document

10.1016/j.specom.2024.103103

Cite this

@article{540d8a093cbf4796bb77652f10f1d5f7,

title = "Whisper-SV: Adapting Whisper for low-data-resource speaker verification",

abstract = "Trained on 680,000 h of massive speech data, Whisper is a multitasking, multilingual speech foundation model demonstrating superior performance in automatic speech recognition, translation, and language identification. However, its applicability in speaker verification (SV) tasks remains unexplored, particularly in low-data-resource scenarios where labeled speaker data in specific domains are limited. To fill this gap, we propose a lightweight adaptor framework to boost SV with Whisper, namely Whisper-SV. Given that Whisper is not specifically optimized for SV tasks, we introduce a representation selection module to quantify the speaker-specific characteristics contained in each layer of Whisper and select the top-k layers with prominent discriminative speaker features. To aggregate pivotal speaker-related features while diminishing non-speaker redundancies across the selected top-k distinct layers of Whisper, we design a multi-layer aggregation module in Whisper-SV to integrate multi-layer representations into a singular, compacted representation for SV. In the multi-layer aggregation module, we employ convolutional layers with shortcut connections among different layers to refine speaker characteristics derived from multi-layer representations from Whisper. In addition, an attention aggregation layer is used to reduce non-speaker interference and amplify speaker-specific cues for SV tasks. Finally, a simple classification module is used for speaker classification. Experiments on VoxCeleb1, FFSVC, and IMSV datasets demonstrate that Whisper-SV achieves EER/minDCF of 2.22%/0.307, 6.14%/0.488, and 7.50%/0.582, respectively, showing superior performance in low-data-resource SV scenarios.",

keywords = "Adaptor, Low-data-resource, Speaker verification, Whisper",

author = "Li Zhang and Ning Jiang and Qing Wang and Yue Li and Quan Lu and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2024 Elsevier B.V.",

year = "2024",

month = sep,

doi = "10.1016/j.specom.2024.103103",

language = "英语",

volume = "163",

journal = "Speech Communication",

issn = "0167-6393",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Whisper-SV

T2 - Adapting Whisper for low-data-resource speaker verification

AU - Zhang, Li

AU - Jiang, Ning

AU - Wang, Qing

AU - Li, Yue

AU - Lu, Quan

AU - Xie, Lei

PY - 2024/9

Y1 - 2024/9

N2 - Trained on 680,000 h of massive speech data, Whisper is a multitasking, multilingual speech foundation model demonstrating superior performance in automatic speech recognition, translation, and language identification. However, its applicability in speaker verification (SV) tasks remains unexplored, particularly in low-data-resource scenarios where labeled speaker data in specific domains are limited. To fill this gap, we propose a lightweight adaptor framework to boost SV with Whisper, namely Whisper-SV. Given that Whisper is not specifically optimized for SV tasks, we introduce a representation selection module to quantify the speaker-specific characteristics contained in each layer of Whisper and select the top-k layers with prominent discriminative speaker features. To aggregate pivotal speaker-related features while diminishing non-speaker redundancies across the selected top-k distinct layers of Whisper, we design a multi-layer aggregation module in Whisper-SV to integrate multi-layer representations into a singular, compacted representation for SV. In the multi-layer aggregation module, we employ convolutional layers with shortcut connections among different layers to refine speaker characteristics derived from multi-layer representations from Whisper. In addition, an attention aggregation layer is used to reduce non-speaker interference and amplify speaker-specific cues for SV tasks. Finally, a simple classification module is used for speaker classification. Experiments on VoxCeleb1, FFSVC, and IMSV datasets demonstrate that Whisper-SV achieves EER/minDCF of 2.22%/0.307, 6.14%/0.488, and 7.50%/0.582, respectively, showing superior performance in low-data-resource SV scenarios.

AB - Trained on 680,000 h of massive speech data, Whisper is a multitasking, multilingual speech foundation model demonstrating superior performance in automatic speech recognition, translation, and language identification. However, its applicability in speaker verification (SV) tasks remains unexplored, particularly in low-data-resource scenarios where labeled speaker data in specific domains are limited. To fill this gap, we propose a lightweight adaptor framework to boost SV with Whisper, namely Whisper-SV. Given that Whisper is not specifically optimized for SV tasks, we introduce a representation selection module to quantify the speaker-specific characteristics contained in each layer of Whisper and select the top-k layers with prominent discriminative speaker features. To aggregate pivotal speaker-related features while diminishing non-speaker redundancies across the selected top-k distinct layers of Whisper, we design a multi-layer aggregation module in Whisper-SV to integrate multi-layer representations into a singular, compacted representation for SV. In the multi-layer aggregation module, we employ convolutional layers with shortcut connections among different layers to refine speaker characteristics derived from multi-layer representations from Whisper. In addition, an attention aggregation layer is used to reduce non-speaker interference and amplify speaker-specific cues for SV tasks. Finally, a simple classification module is used for speaker classification. Experiments on VoxCeleb1, FFSVC, and IMSV datasets demonstrate that Whisper-SV achieves EER/minDCF of 2.22%/0.307, 6.14%/0.488, and 7.50%/0.582, respectively, showing superior performance in low-data-resource SV scenarios.

KW - Adaptor

KW - Low-data-resource

KW - Speaker verification

KW - Whisper

UR - http://www.scopus.com/inward/record.url?scp=85199020095&partnerID=8YFLogxK

U2 - 10.1016/j.specom.2024.103103

DO - 10.1016/j.specom.2024.103103

M3 - 文章

AN - SCOPUS:85199020095

SN - 0167-6393

VL - 163

JO - Speech Communication

JF - Speech Communication

M1 - 103103

ER -

Whisper-SV: Adapting Whisper for low-data-resource speaker verification

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this