AccMyrinx: Speech Synthesis with Non-Acoustic Sensor

Yunji Liang; Yuchen Qin; Qi Li; Xiaokai Yan; Zhiwen Yu; Bin Guo; Sagar Samtani; Yanyong Zhang

doi:10.1145/3550338

AccMyrinx: Speech Synthesis with Non-Acoustic Sensor

Yunji Liang, Yuchen Qin, Qi Li, Xiaokai Yan, Zhiwen Yu, Bin Guo, Sagar Samtani, Yanyong Zhang

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

2 引用（Scopus）

摘要

The built-in loudspeakers of mobile devices (e.g., smartphones, smartwatches, and tablets) play significant roles in human-machine interaction, such as playing music, making phone calls, and enabling voice-based interaction. Prior studies have pointed out that it is feasible to eavesdrop on the speaker via motion sensors, but whether it is possible to synthesize speech from non-acoustic signals with sub-Nyquist sampling frequency has not been studied. In this paper, we present an end-to-end model to reconstruct the acoustic waveforms that are playing on the loudspeaker through the vibration captured by the built-in accelerometer. Specifically, we present an end-to-end speech synthesis framework dubbed AccMyrinx to eavesdrop on the speaker using the built-in low-resolution accelerometer of mobile devices. AccMyrinx takes advantage of the coexistence of an accelerometer with the loudspeaker on the same motherboard and compromises the loudspeaker by the solid-borne vibrations captured by the accelerometer. Low-resolution vibration signals are fed to a wavelet-based MelGAN to generate intelligible acoustic waveforms. We conducted extensive experiments on a large-scale dataset created based on audio clips downloaded from Voice of America (VOA). The experimental results show that AccMyrinx is capable of reconstructing intelligible acoustic signals that are playing on the loudspeaker with a smoothed word error rate (SWER) of 42.67%. The quality of synthesized speeches could be severely affected by several factors including gender, speech rate, and volume.

源语言	英语
文章编号	127
期刊	Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
卷	6
期	3
DOI	https://doi.org/10.1145/3550338
出版状态	已出版 - 7 9月 2022

访问文件

10.1145/3550338

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{a0241dc8079d402ea0f1cc6f3c056627,

title = "AccMyrinx: Speech Synthesis with Non-Acoustic Sensor",

abstract = "The built-in loudspeakers of mobile devices (e.g., smartphones, smartwatches, and tablets) play significant roles in human-machine interaction, such as playing music, making phone calls, and enabling voice-based interaction. Prior studies have pointed out that it is feasible to eavesdrop on the speaker via motion sensors, but whether it is possible to synthesize speech from non-acoustic signals with sub-Nyquist sampling frequency has not been studied. In this paper, we present an end-to-end model to reconstruct the acoustic waveforms that are playing on the loudspeaker through the vibration captured by the built-in accelerometer. Specifically, we present an end-to-end speech synthesis framework dubbed AccMyrinx to eavesdrop on the speaker using the built-in low-resolution accelerometer of mobile devices. AccMyrinx takes advantage of the coexistence of an accelerometer with the loudspeaker on the same motherboard and compromises the loudspeaker by the solid-borne vibrations captured by the accelerometer. Low-resolution vibration signals are fed to a wavelet-based MelGAN to generate intelligible acoustic waveforms. We conducted extensive experiments on a large-scale dataset created based on audio clips downloaded from Voice of America (VOA). The experimental results show that AccMyrinx is capable of reconstructing intelligible acoustic signals that are playing on the loudspeaker with a smoothed word error rate (SWER) of 42.67%. The quality of synthesized speeches could be severely affected by several factors including gender, speech rate, and volume.",

keywords = "accelerometer, generative adversary network, non-acoustic sensor, speaker, speech synthesis",

author = "Yunji Liang and Yuchen Qin and Qi Li and Xiaokai Yan and Zhiwen Yu and Bin Guo and Sagar Samtani and Yanyong Zhang",

note = "Publisher Copyright: {\textcopyright} 2022 ACM.",

year = "2022",

month = sep,

day = "7",

doi = "10.1145/3550338",

language = "英语",

volume = "6",

journal = "Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies",

issn = "2474-9567",

publisher = "Association for Computing Machinery (ACM)",

number = "3",

}

TY - JOUR

T1 - AccMyrinx

T2 - Speech Synthesis with Non-Acoustic Sensor

AU - Liang, Yunji

AU - Qin, Yuchen

AU - Li, Qi

AU - Yan, Xiaokai

AU - Yu, Zhiwen

AU - Guo, Bin

AU - Samtani, Sagar

AU - Zhang, Yanyong

PY - 2022/9/7

Y1 - 2022/9/7

N2 - The built-in loudspeakers of mobile devices (e.g., smartphones, smartwatches, and tablets) play significant roles in human-machine interaction, such as playing music, making phone calls, and enabling voice-based interaction. Prior studies have pointed out that it is feasible to eavesdrop on the speaker via motion sensors, but whether it is possible to synthesize speech from non-acoustic signals with sub-Nyquist sampling frequency has not been studied. In this paper, we present an end-to-end model to reconstruct the acoustic waveforms that are playing on the loudspeaker through the vibration captured by the built-in accelerometer. Specifically, we present an end-to-end speech synthesis framework dubbed AccMyrinx to eavesdrop on the speaker using the built-in low-resolution accelerometer of mobile devices. AccMyrinx takes advantage of the coexistence of an accelerometer with the loudspeaker on the same motherboard and compromises the loudspeaker by the solid-borne vibrations captured by the accelerometer. Low-resolution vibration signals are fed to a wavelet-based MelGAN to generate intelligible acoustic waveforms. We conducted extensive experiments on a large-scale dataset created based on audio clips downloaded from Voice of America (VOA). The experimental results show that AccMyrinx is capable of reconstructing intelligible acoustic signals that are playing on the loudspeaker with a smoothed word error rate (SWER) of 42.67%. The quality of synthesized speeches could be severely affected by several factors including gender, speech rate, and volume.

AB - The built-in loudspeakers of mobile devices (e.g., smartphones, smartwatches, and tablets) play significant roles in human-machine interaction, such as playing music, making phone calls, and enabling voice-based interaction. Prior studies have pointed out that it is feasible to eavesdrop on the speaker via motion sensors, but whether it is possible to synthesize speech from non-acoustic signals with sub-Nyquist sampling frequency has not been studied. In this paper, we present an end-to-end model to reconstruct the acoustic waveforms that are playing on the loudspeaker through the vibration captured by the built-in accelerometer. Specifically, we present an end-to-end speech synthesis framework dubbed AccMyrinx to eavesdrop on the speaker using the built-in low-resolution accelerometer of mobile devices. AccMyrinx takes advantage of the coexistence of an accelerometer with the loudspeaker on the same motherboard and compromises the loudspeaker by the solid-borne vibrations captured by the accelerometer. Low-resolution vibration signals are fed to a wavelet-based MelGAN to generate intelligible acoustic waveforms. We conducted extensive experiments on a large-scale dataset created based on audio clips downloaded from Voice of America (VOA). The experimental results show that AccMyrinx is capable of reconstructing intelligible acoustic signals that are playing on the loudspeaker with a smoothed word error rate (SWER) of 42.67%. The quality of synthesized speeches could be severely affected by several factors including gender, speech rate, and volume.

KW - accelerometer

KW - generative adversary network

KW - non-acoustic sensor

KW - speaker

KW - speech synthesis

UR - http://www.scopus.com/inward/record.url?scp=85139249714&partnerID=8YFLogxK

U2 - 10.1145/3550338

DO - 10.1145/3550338

M3 - 文章

AN - SCOPUS:85139249714

SN - 2474-9567

VL - 6

JO - Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies

JF - Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies

IS - 3

M1 - 127

ER -

AccMyrinx: Speech Synthesis with Non-Acoustic Sensor

摘要

访问文件

其它文件与链接

指纹

引用此