Dualsep: A Light-Weight Dual-Encoder Convolutional Recurrent Network For Real-Time In-Car Speech Separation

Ziqian Wang; Jiayao Sun; Zihan Zhang; Xingchen Li; Jie Liu; Lei Xie

doi:10.1109/SLT61566.2024.10832223

Dualsep: A Light-Weight Dual-Encoder Convolutional Recurrent Network For Real-Time In-Car Speech Separation

Ziqian Wang, Jiayao Sun, Zihan Zhang, Xingchen Li, Jie Liu, Lei Xie

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

1 Scopus citations

Abstract

Advancements in deep learning and voice-activated technologies have driven the development of human-vehicle interaction. Distributed microphone arrays are widely used in incar scenarios because they can accurately capture the voices of passengers from different speech zones. However, the increase in the number of audio channels, coupled with the limited computational resources and low latency requirements of in-car systems, presents challenges for in-car multi-channel speech separation. To migrate the problems, we propose a lightweight framework that cascades digital signal processing (DSP) and neural networks (NN). We utilize fixed beamforming (BF) to reduce computational costs and independent vector analysis (IVA) to provide spatial prior. We employ dual encoders for dual-branch modeling, with spatial encoder capturing spatial cues and spectral encoder preserving spectral information, facilitating spatial-spectral fusion. Our proposed system supports both streaming and non-streaming modes. Experimental results demonstrate the superiority of the proposed system across various metrics. With only 0.83 M parameters and 0.39 real-time factor (RTF) on an Intel Core i7 (2.6 GHz) CPU, it effectively separates speech into distinct speech zones.

Original language	English
Title of host publication	Proceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	286-293
Number of pages	8
ISBN (Electronic)	9798350392258
DOIs	https://doi.org/10.1109/SLT61566.2024.10832223
State	Published - 2024
Event	2024 IEEE Spoken Language Technology Workshop, SLT 2024 - Macao, China Duration: 2 Dec 2024 → 5 Dec 2024

Publication series

Name	Proceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024

Conference

Conference	2024 IEEE Spoken Language Technology Workshop, SLT 2024
Country/Territory	China
City	Macao
Period	2/12/24 → 5/12/24

Keywords

deep learning
in-car communication
microphone arrays
speech separation

Access to Document

10.1109/SLT61566.2024.10832223

Cite this

Wang, Z., Sun, J., Zhang, Z., Li, X., Liu, J., & Xie, L. (2024). Dualsep: A Light-Weight Dual-Encoder Convolutional Recurrent Network For Real-Time In-Car Speech Separation. In Proceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024 (pp. 286-293). (Proceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/SLT61566.2024.10832223

Wang, Ziqian ; Sun, Jiayao ; Zhang, Zihan et al. / Dualsep : A Light-Weight Dual-Encoder Convolutional Recurrent Network For Real-Time In-Car Speech Separation. Proceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024. Institute of Electrical and Electronics Engineers Inc., 2024. pp. 286-293 (Proceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024).

@inproceedings{cf63882df5eb4daaba7a386ed70d6cf7,

title = "Dualsep: A Light-Weight Dual-Encoder Convolutional Recurrent Network For Real-Time In-Car Speech Separation",

abstract = "Advancements in deep learning and voice-activated technologies have driven the development of human-vehicle interaction. Distributed microphone arrays are widely used in incar scenarios because they can accurately capture the voices of passengers from different speech zones. However, the increase in the number of audio channels, coupled with the limited computational resources and low latency requirements of in-car systems, presents challenges for in-car multi-channel speech separation. To migrate the problems, we propose a lightweight framework that cascades digital signal processing (DSP) and neural networks (NN). We utilize fixed beamforming (BF) to reduce computational costs and independent vector analysis (IVA) to provide spatial prior. We employ dual encoders for dual-branch modeling, with spatial encoder capturing spatial cues and spectral encoder preserving spectral information, facilitating spatial-spectral fusion. Our proposed system supports both streaming and non-streaming modes. Experimental results demonstrate the superiority of the proposed system across various metrics. With only 0.83 M parameters and 0.39 real-time factor (RTF) on an Intel Core i7 (2.6 GHz) CPU, it effectively separates speech into distinct speech zones.",

keywords = "deep learning, in-car communication, microphone arrays, speech separation",

author = "Ziqian Wang and Jiayao Sun and Zihan Zhang and Xingchen Li and Jie Liu and Lei Xie",

note = "Publisher Copyright: {\textcopyright} 2024 IEEE.; 2024 IEEE Spoken Language Technology Workshop, SLT 2024 ; Conference date: 02-12-2024 Through 05-12-2024",

year = "2024",

doi = "10.1109/SLT61566.2024.10832223",

language = "英语",

series = "Proceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "286--293",

booktitle = "Proceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024",

}

Wang, Z, Sun, J, Zhang, Z, Li, X, Liu, J & Xie, L 2024, Dualsep: A Light-Weight Dual-Encoder Convolutional Recurrent Network For Real-Time In-Car Speech Separation. in Proceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024. Proceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024, Institute of Electrical and Electronics Engineers Inc., pp. 286-293, 2024 IEEE Spoken Language Technology Workshop, SLT 2024, Macao, China, 2/12/24. https://doi.org/10.1109/SLT61566.2024.10832223

Dualsep: A Light-Weight Dual-Encoder Convolutional Recurrent Network For Real-Time In-Car Speech Separation. / Wang, Ziqian; Sun, Jiayao; Zhang, Zihan et al.
Proceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024. Institute of Electrical and Electronics Engineers Inc., 2024. p. 286-293 (Proceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Dualsep

T2 - 2024 IEEE Spoken Language Technology Workshop, SLT 2024

AU - Wang, Ziqian

AU - Sun, Jiayao

AU - Zhang, Zihan

AU - Li, Xingchen

AU - Liu, Jie

AU - Xie, Lei

PY - 2024

Y1 - 2024

N2 - Advancements in deep learning and voice-activated technologies have driven the development of human-vehicle interaction. Distributed microphone arrays are widely used in incar scenarios because they can accurately capture the voices of passengers from different speech zones. However, the increase in the number of audio channels, coupled with the limited computational resources and low latency requirements of in-car systems, presents challenges for in-car multi-channel speech separation. To migrate the problems, we propose a lightweight framework that cascades digital signal processing (DSP) and neural networks (NN). We utilize fixed beamforming (BF) to reduce computational costs and independent vector analysis (IVA) to provide spatial prior. We employ dual encoders for dual-branch modeling, with spatial encoder capturing spatial cues and spectral encoder preserving spectral information, facilitating spatial-spectral fusion. Our proposed system supports both streaming and non-streaming modes. Experimental results demonstrate the superiority of the proposed system across various metrics. With only 0.83 M parameters and 0.39 real-time factor (RTF) on an Intel Core i7 (2.6 GHz) CPU, it effectively separates speech into distinct speech zones.

AB - Advancements in deep learning and voice-activated technologies have driven the development of human-vehicle interaction. Distributed microphone arrays are widely used in incar scenarios because they can accurately capture the voices of passengers from different speech zones. However, the increase in the number of audio channels, coupled with the limited computational resources and low latency requirements of in-car systems, presents challenges for in-car multi-channel speech separation. To migrate the problems, we propose a lightweight framework that cascades digital signal processing (DSP) and neural networks (NN). We utilize fixed beamforming (BF) to reduce computational costs and independent vector analysis (IVA) to provide spatial prior. We employ dual encoders for dual-branch modeling, with spatial encoder capturing spatial cues and spectral encoder preserving spectral information, facilitating spatial-spectral fusion. Our proposed system supports both streaming and non-streaming modes. Experimental results demonstrate the superiority of the proposed system across various metrics. With only 0.83 M parameters and 0.39 real-time factor (RTF) on an Intel Core i7 (2.6 GHz) CPU, it effectively separates speech into distinct speech zones.

KW - deep learning

KW - in-car communication

KW - microphone arrays

KW - speech separation

UR - http://www.scopus.com/inward/record.url?scp=85217422507&partnerID=8YFLogxK

U2 - 10.1109/SLT61566.2024.10832223

DO - 10.1109/SLT61566.2024.10832223

M3 - 会议稿件

AN - SCOPUS:85217422507

T3 - Proceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024

SP - 286

EP - 293

BT - Proceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 2 December 2024 through 5 December 2024

ER -

Wang Z, Sun J, Zhang Z, Li X, Liu J, Xie L. Dualsep: A Light-Weight Dual-Encoder Convolutional Recurrent Network For Real-Time In-Car Speech Separation. In Proceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024. Institute of Electrical and Electronics Engineers Inc. 2024. p. 286-293. (Proceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024). doi: 10.1109/SLT61566.2024.10832223

Dualsep: A Light-Weight Dual-Encoder Convolutional Recurrent Network For Real-Time In-Car Speech Separation

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this