Pseudo-Siamese Network based Timbre-reserved Black-box Adversarial Attack in Speaker Identification

Qing Wang; Jixun Yao; Ziqian Wang; Pengcheng Guo; Lei Xie

doi:10.21437/Interspeech.2023-1352

Pseudo-Siamese Network based Timbre-reserved Black-box Adversarial Attack in Speaker Identification

Qing Wang, Jixun Yao, Ziqian Wang, Pengcheng Guo, Lei Xie

School of Computer Science

Northwestern Polytechnical University Xian

Research output: Contribution to journal › Conference article › peer-review

1 Scopus citations

Abstract

In this study, we propose a timbre-reserved adversarial attack approach for speaker identification (SID) to not only exploit the weakness of the SID model but also preserve the timbre of the target speaker in a black-box attack setting. Particularly, we generate timbre-reserved fake audio by adding an adversarial constraint during the training of the voice conversion model. Then, we leverage a pseudo-Siamese network architecture to learn from the black-box SID model constraining both intrinsic similarity and structural similarity simultaneously. The intrinsic similarity loss is to learn an intrinsic invariance, while the structural similarity loss is to ensure that the substitute SID model shares a similar decision boundary to the fixed black-box SID model. The substitute model can be used as a proxy to generate timbre-reserved fake audio for attacking. Experimental results on the Audio Deepfake Detection (ADD) challenge dataset indicate that the attack success rate of our proposed approach yields up to 60.58% and 55.38% in the white-box and black-box scenarios, respectively, and can deceive both human beings and machines.

Original language	English
Pages (from-to)	3994-3998
Number of pages	5
Journal	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	2023-August
DOIs	https://doi.org/10.21437/Interspeech.2023-1352
State	Published - 2023
Event	24th International Speech Communication Association, Interspeech 2023 - Dublin, Ireland Duration: 20 Aug 2023 → 24 Aug 2023

Keywords

adversarial attack
black-box
speaker identification
timbre-reserved

Access to Document

10.21437/Interspeech.2023-1352

Cite this

@article{bd704076786043d6ad9e64de9885dfd3,

title = "Pseudo-Siamese Network based Timbre-reserved Black-box Adversarial Attack in Speaker Identification",

abstract = "In this study, we propose a timbre-reserved adversarial attack approach for speaker identification (SID) to not only exploit the weakness of the SID model but also preserve the timbre of the target speaker in a black-box attack setting. Particularly, we generate timbre-reserved fake audio by adding an adversarial constraint during the training of the voice conversion model. Then, we leverage a pseudo-Siamese network architecture to learn from the black-box SID model constraining both intrinsic similarity and structural similarity simultaneously. The intrinsic similarity loss is to learn an intrinsic invariance, while the structural similarity loss is to ensure that the substitute SID model shares a similar decision boundary to the fixed black-box SID model. The substitute model can be used as a proxy to generate timbre-reserved fake audio for attacking. Experimental results on the Audio Deepfake Detection (ADD) challenge dataset indicate that the attack success rate of our proposed approach yields up to 60.58% and 55.38% in the white-box and black-box scenarios, respectively, and can deceive both human beings and machines.",

keywords = "adversarial attack, black-box, speaker identification, timbre-reserved",

author = "Qing Wang and Jixun Yao and Ziqian Wang and Pengcheng Guo and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2023 International Speech Communication Association. All rights reserved.; 24th International Speech Communication Association, Interspeech 2023 ; Conference date: 20-08-2023 Through 24-08-2023",

year = "2023",

doi = "10.21437/Interspeech.2023-1352",

language = "英语",

volume = "2023-August",

pages = "3994--3998",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

Pseudo-Siamese Network based Timbre-reserved Black-box Adversarial Attack in Speaker Identification. / Wang, Qing; Yao, Jixun; Wang, Ziqian et al.
In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2023-August, 2023, p. 3994-3998.

Research output: Contribution to journal › Conference article › peer-review

TY - JOUR

T1 - Pseudo-Siamese Network based Timbre-reserved Black-box Adversarial Attack in Speaker Identification

AU - Wang, Qing

AU - Yao, Jixun

AU - Wang, Ziqian

AU - Guo, Pengcheng

AU - Xie, Lei

PY - 2023

Y1 - 2023

N2 - In this study, we propose a timbre-reserved adversarial attack approach for speaker identification (SID) to not only exploit the weakness of the SID model but also preserve the timbre of the target speaker in a black-box attack setting. Particularly, we generate timbre-reserved fake audio by adding an adversarial constraint during the training of the voice conversion model. Then, we leverage a pseudo-Siamese network architecture to learn from the black-box SID model constraining both intrinsic similarity and structural similarity simultaneously. The intrinsic similarity loss is to learn an intrinsic invariance, while the structural similarity loss is to ensure that the substitute SID model shares a similar decision boundary to the fixed black-box SID model. The substitute model can be used as a proxy to generate timbre-reserved fake audio for attacking. Experimental results on the Audio Deepfake Detection (ADD) challenge dataset indicate that the attack success rate of our proposed approach yields up to 60.58% and 55.38% in the white-box and black-box scenarios, respectively, and can deceive both human beings and machines.

AB - In this study, we propose a timbre-reserved adversarial attack approach for speaker identification (SID) to not only exploit the weakness of the SID model but also preserve the timbre of the target speaker in a black-box attack setting. Particularly, we generate timbre-reserved fake audio by adding an adversarial constraint during the training of the voice conversion model. Then, we leverage a pseudo-Siamese network architecture to learn from the black-box SID model constraining both intrinsic similarity and structural similarity simultaneously. The intrinsic similarity loss is to learn an intrinsic invariance, while the structural similarity loss is to ensure that the substitute SID model shares a similar decision boundary to the fixed black-box SID model. The substitute model can be used as a proxy to generate timbre-reserved fake audio for attacking. Experimental results on the Audio Deepfake Detection (ADD) challenge dataset indicate that the attack success rate of our proposed approach yields up to 60.58% and 55.38% in the white-box and black-box scenarios, respectively, and can deceive both human beings and machines.

KW - adversarial attack

KW - black-box

KW - speaker identification

KW - timbre-reserved

UR - http://www.scopus.com/inward/record.url?scp=85171532318&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2023-1352

DO - 10.21437/Interspeech.2023-1352

M3 - 会议文章

AN - SCOPUS:85171532318

SN - 2308-457X

VL - 2023-August

SP - 3994

EP - 3998

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 24th International Speech Communication Association, Interspeech 2023

Y2 - 20 August 2023 through 24 August 2023

ER -

Pseudo-Siamese Network based Timbre-reserved Black-box Adversarial Attack in Speaker Identification

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this