TY - JOUR
T1 - Timbre-Reserved Adversarial Attack in Speaker Identification
AU - Wang, Qing
AU - Yao, Jixun
AU - Zhang, Li
AU - Guo, Pengcheng
AU - Xie, Lei
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2023
Y1 - 2023
N2 - As a type of biometric identification, speaker identification (SID) systems face various attacks. Spoofing attacks imitate target speakers' timbre, while adversarial attacks confuse SID systems with well-designed perturbations. Spoofing mimics victim timbre but fails to exploit SID model vulnerabilities, potentially not achieving the attacker.s goal. On the other hand, adversarial attacks can lead SID to a decision but may not meet specific text or speaker timbre requirements for certain attack scenarios. In this study, we propose a timbre-reserved adversarial attack in speaker identification to leverage SID model vulnerabilities while preserving the target speaker.s timbre. We generate timbre-reserved adversarial audio by adding an adversarial constraint during different training stages of the voice conversion (VC) model. This constraint utilizes the target speaker label to optimize adversarial perturbations in VC model representations and is implemented through a speaker classifier integrated into VC model training. This adversarial constraint helps control the VC model to generate speaker-wised audio. Ultimately, the VC model.s inference produces ideal timbre-reserved adversarial audio capable of deceiving SID system. Experimental results on the Audio deepfake detection (ADD) challenge dataset demonstrate that our method significantly improves attack success rate compared to the vanilla VC model, without introducing additional adversarial noise to the attack speech. Objective and subjective evaluations confirm the superior quality of fake audio generated by our approach compared to directly adding adversarial perturbation to VC-generated audio. Additionally, our analysis indicates that our generated adversarial fake audio meets the specified text and target speaker timbre requirements of the attacker.
AB - As a type of biometric identification, speaker identification (SID) systems face various attacks. Spoofing attacks imitate target speakers' timbre, while adversarial attacks confuse SID systems with well-designed perturbations. Spoofing mimics victim timbre but fails to exploit SID model vulnerabilities, potentially not achieving the attacker.s goal. On the other hand, adversarial attacks can lead SID to a decision but may not meet specific text or speaker timbre requirements for certain attack scenarios. In this study, we propose a timbre-reserved adversarial attack in speaker identification to leverage SID model vulnerabilities while preserving the target speaker.s timbre. We generate timbre-reserved adversarial audio by adding an adversarial constraint during different training stages of the voice conversion (VC) model. This constraint utilizes the target speaker label to optimize adversarial perturbations in VC model representations and is implemented through a speaker classifier integrated into VC model training. This adversarial constraint helps control the VC model to generate speaker-wised audio. Ultimately, the VC model.s inference produces ideal timbre-reserved adversarial audio capable of deceiving SID system. Experimental results on the Audio deepfake detection (ADD) challenge dataset demonstrate that our method significantly improves attack success rate compared to the vanilla VC model, without introducing additional adversarial noise to the attack speech. Objective and subjective evaluations confirm the superior quality of fake audio generated by our approach compared to directly adding adversarial perturbation to VC-generated audio. Additionally, our analysis indicates that our generated adversarial fake audio meets the specified text and target speaker timbre requirements of the attacker.
KW - Adversarial attack
KW - speaker identification
KW - timbre-reserved
KW - voice conversion
UR - http://www.scopus.com/inward/record.url?scp=85168735318&partnerID=8YFLogxK
U2 - 10.1109/TASLP.2023.3306714
DO - 10.1109/TASLP.2023.3306714
M3 - 文章
AN - SCOPUS:85168735318
SN - 2329-9290
VL - 31
SP - 3848
EP - 3858
JO - IEEE/ACM Transactions on Audio Speech and Language Processing
JF - IEEE/ACM Transactions on Audio Speech and Language Processing
ER -