Timbre-Reserved Adversarial Attack in Speaker Identification

Qing Wang; Jixun Yao; Li Zhang; Pengcheng Guo; Lei Xie

doi:10.1109/TASLP.2023.3306714

Timbre-Reserved Adversarial Attack in Speaker Identification

Qing Wang, Jixun Yao, Li Zhang, Pengcheng Guo, Lei Xie

School of Computer Science

Northwestern Polytechnical University Xian

Research output: Contribution to journal › Article › peer-review

4 Scopus citations

Abstract

As a type of biometric identification, speaker identification (SID) systems face various attacks. Spoofing attacks imitate target speakers' timbre, while adversarial attacks confuse SID systems with well-designed perturbations. Spoofing mimics victim timbre but fails to exploit SID model vulnerabilities, potentially not achieving the attacker.s goal. On the other hand, adversarial attacks can lead SID to a decision but may not meet specific text or speaker timbre requirements for certain attack scenarios. In this study, we propose a timbre-reserved adversarial attack in speaker identification to leverage SID model vulnerabilities while preserving the target speaker.s timbre. We generate timbre-reserved adversarial audio by adding an adversarial constraint during different training stages of the voice conversion (VC) model. This constraint utilizes the target speaker label to optimize adversarial perturbations in VC model representations and is implemented through a speaker classifier integrated into VC model training. This adversarial constraint helps control the VC model to generate speaker-wised audio. Ultimately, the VC model.s inference produces ideal timbre-reserved adversarial audio capable of deceiving SID system. Experimental results on the Audio deepfake detection (ADD) challenge dataset demonstrate that our method significantly improves attack success rate compared to the vanilla VC model, without introducing additional adversarial noise to the attack speech. Objective and subjective evaluations confirm the superior quality of fake audio generated by our approach compared to directly adding adversarial perturbation to VC-generated audio. Additionally, our analysis indicates that our generated adversarial fake audio meets the specified text and target speaker timbre requirements of the attacker.

Original language	English
Pages (from-to)	3848-3858
Number of pages	11
Journal	IEEE/ACM Transactions on Audio Speech and Language Processing
Volume	31
DOIs	https://doi.org/10.1109/TASLP.2023.3306714
State	Published - 2023

Keywords

Adversarial attack
speaker identification
timbre-reserved
voice conversion

Access to Document

10.1109/TASLP.2023.3306714

Cite this

@article{3babecf00bc2469fb7a6984ab14cd67b,

title = "Timbre-Reserved Adversarial Attack in Speaker Identification",

abstract = "As a type of biometric identification, speaker identification (SID) systems face various attacks. Spoofing attacks imitate target speakers' timbre, while adversarial attacks confuse SID systems with well-designed perturbations. Spoofing mimics victim timbre but fails to exploit SID model vulnerabilities, potentially not achieving the attacker.s goal. On the other hand, adversarial attacks can lead SID to a decision but may not meet specific text or speaker timbre requirements for certain attack scenarios. In this study, we propose a timbre-reserved adversarial attack in speaker identification to leverage SID model vulnerabilities while preserving the target speaker.s timbre. We generate timbre-reserved adversarial audio by adding an adversarial constraint during different training stages of the voice conversion (VC) model. This constraint utilizes the target speaker label to optimize adversarial perturbations in VC model representations and is implemented through a speaker classifier integrated into VC model training. This adversarial constraint helps control the VC model to generate speaker-wised audio. Ultimately, the VC model.s inference produces ideal timbre-reserved adversarial audio capable of deceiving SID system. Experimental results on the Audio deepfake detection (ADD) challenge dataset demonstrate that our method significantly improves attack success rate compared to the vanilla VC model, without introducing additional adversarial noise to the attack speech. Objective and subjective evaluations confirm the superior quality of fake audio generated by our approach compared to directly adding adversarial perturbation to VC-generated audio. Additionally, our analysis indicates that our generated adversarial fake audio meets the specified text and target speaker timbre requirements of the attacker.",

keywords = "Adversarial attack, speaker identification, timbre-reserved, voice conversion",

author = "Qing Wang and Jixun Yao and Li Zhang and Pengcheng Guo and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2014 IEEE.",

year = "2023",

doi = "10.1109/TASLP.2023.3306714",

language = "英语",

volume = "31",

pages = "3848--3858",

journal = "IEEE/ACM Transactions on Audio Speech and Language Processing",

issn = "2329-9290",

publisher = "IEEE Advancing Technology for Humanity",

}

TY - JOUR

T1 - Timbre-Reserved Adversarial Attack in Speaker Identification

AU - Wang, Qing

AU - Yao, Jixun

AU - Zhang, Li

AU - Guo, Pengcheng

AU - Xie, Lei

PY - 2023

Y1 - 2023

N2 - As a type of biometric identification, speaker identification (SID) systems face various attacks. Spoofing attacks imitate target speakers' timbre, while adversarial attacks confuse SID systems with well-designed perturbations. Spoofing mimics victim timbre but fails to exploit SID model vulnerabilities, potentially not achieving the attacker.s goal. On the other hand, adversarial attacks can lead SID to a decision but may not meet specific text or speaker timbre requirements for certain attack scenarios. In this study, we propose a timbre-reserved adversarial attack in speaker identification to leverage SID model vulnerabilities while preserving the target speaker.s timbre. We generate timbre-reserved adversarial audio by adding an adversarial constraint during different training stages of the voice conversion (VC) model. This constraint utilizes the target speaker label to optimize adversarial perturbations in VC model representations and is implemented through a speaker classifier integrated into VC model training. This adversarial constraint helps control the VC model to generate speaker-wised audio. Ultimately, the VC model.s inference produces ideal timbre-reserved adversarial audio capable of deceiving SID system. Experimental results on the Audio deepfake detection (ADD) challenge dataset demonstrate that our method significantly improves attack success rate compared to the vanilla VC model, without introducing additional adversarial noise to the attack speech. Objective and subjective evaluations confirm the superior quality of fake audio generated by our approach compared to directly adding adversarial perturbation to VC-generated audio. Additionally, our analysis indicates that our generated adversarial fake audio meets the specified text and target speaker timbre requirements of the attacker.

AB - As a type of biometric identification, speaker identification (SID) systems face various attacks. Spoofing attacks imitate target speakers' timbre, while adversarial attacks confuse SID systems with well-designed perturbations. Spoofing mimics victim timbre but fails to exploit SID model vulnerabilities, potentially not achieving the attacker.s goal. On the other hand, adversarial attacks can lead SID to a decision but may not meet specific text or speaker timbre requirements for certain attack scenarios. In this study, we propose a timbre-reserved adversarial attack in speaker identification to leverage SID model vulnerabilities while preserving the target speaker.s timbre. We generate timbre-reserved adversarial audio by adding an adversarial constraint during different training stages of the voice conversion (VC) model. This constraint utilizes the target speaker label to optimize adversarial perturbations in VC model representations and is implemented through a speaker classifier integrated into VC model training. This adversarial constraint helps control the VC model to generate speaker-wised audio. Ultimately, the VC model.s inference produces ideal timbre-reserved adversarial audio capable of deceiving SID system. Experimental results on the Audio deepfake detection (ADD) challenge dataset demonstrate that our method significantly improves attack success rate compared to the vanilla VC model, without introducing additional adversarial noise to the attack speech. Objective and subjective evaluations confirm the superior quality of fake audio generated by our approach compared to directly adding adversarial perturbation to VC-generated audio. Additionally, our analysis indicates that our generated adversarial fake audio meets the specified text and target speaker timbre requirements of the attacker.

KW - Adversarial attack

KW - speaker identification

KW - timbre-reserved

KW - voice conversion

UR - http://www.scopus.com/inward/record.url?scp=85168735318&partnerID=8YFLogxK

U2 - 10.1109/TASLP.2023.3306714

DO - 10.1109/TASLP.2023.3306714

M3 - 文章

AN - SCOPUS:85168735318

SN - 2329-9290

VL - 31

SP - 3848

EP - 3858

JO - IEEE/ACM Transactions on Audio Speech and Language Processing

JF - IEEE/ACM Transactions on Audio Speech and Language Processing

ER -

Timbre-Reserved Adversarial Attack in Speaker Identification

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this