Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

Junwen Xiong; Yu Zhou; Peng Zhang; Lei Xie; Wei Huang; Yufei Zha

doi:10.1109/TMM.2022.3199109

Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

Junwen Xiong, Yu Zhou, Peng Zhang, Lei Xie, Wei Huang, Yufei Zha

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

10 引用（Scopus）

摘要

Active speaker detection and speech enhancement have become two increasingly attractive topics in audio-visual scenario understanding. According to their respective characteristics, the scheme of independently designed architecture has been widely used in correspondence to each single task. This may lead to the representation learned by the model being task-specific, and inevitably result in the lack of generalization ability of the feature based on multi-modal modeling. More recent studies have shown that establishing cross-modal relationship between auditory and visual stream is a promising solution for the challenge of audio-visual multi-task learning. Therefore, as a motivation to bridge the multi-modal associations in audio-visual tasks, a unified framework is proposed to achieve target speaker detection and speech enhancement with joint learning of audio-visual modeling in this study. With the assistance of audio-visual channels of videos in challenging real-world scenarios, the proposed method is able to exploit inherent correlations in both audio and visual signals, which is used to further anticipate and model the temporal audio-visual relationships across spatial-temporal space via a cross-modal conformer. In addition, a plug-and-play multi-modal layer normalization is introduced to alleviate the distribution misalignment of multi-modal features. Based on cross-modal circulant fusion, the proposed model is capable to learned all audio-visual representations in a holistic process. Substantial experiments demonstrate that the correlations between different modalities and the associations among diverse tasks can be learned by the optimized model more effectively. In comparison to other state-of-the-art works, the proposed work shows a superior performance for active speaker detection and audio-visual speech enhancement on three benchmark datasets, also with a favorable generalization in diverse challenges.

源语言	英语
页（从-至）	5800-5812
页数	13
期刊	IEEE Transactions on Multimedia
卷	25
DOI	https://doi.org/10.1109/TMM.2022.3199109
出版状态	已出版 - 2023

访问文件

10.1109/TMM.2022.3199109

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{05f83791cbff46b8bc4adbf67eab9762,

title = "Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement",

abstract = "Active speaker detection and speech enhancement have become two increasingly attractive topics in audio-visual scenario understanding. According to their respective characteristics, the scheme of independently designed architecture has been widely used in correspondence to each single task. This may lead to the representation learned by the model being task-specific, and inevitably result in the lack of generalization ability of the feature based on multi-modal modeling. More recent studies have shown that establishing cross-modal relationship between auditory and visual stream is a promising solution for the challenge of audio-visual multi-task learning. Therefore, as a motivation to bridge the multi-modal associations in audio-visual tasks, a unified framework is proposed to achieve target speaker detection and speech enhancement with joint learning of audio-visual modeling in this study. With the assistance of audio-visual channels of videos in challenging real-world scenarios, the proposed method is able to exploit inherent correlations in both audio and visual signals, which is used to further anticipate and model the temporal audio-visual relationships across spatial-temporal space via a cross-modal conformer. In addition, a plug-and-play multi-modal layer normalization is introduced to alleviate the distribution misalignment of multi-modal features. Based on cross-modal circulant fusion, the proposed model is capable to learned all audio-visual representations in a holistic process. Substantial experiments demonstrate that the correlations between different modalities and the associations among diverse tasks can be learned by the optimized model more effectively. In comparison to other state-of-the-art works, the proposed work shows a superior performance for active speaker detection and audio-visual speech enhancement on three benchmark datasets, also with a favorable generalization in diverse challenges.",

keywords = "Active speaker detection, audio-visual correlation learning, speech enhancement",

author = "Junwen Xiong and Yu Zhou and Peng Zhang and Lei Xie and Wei Huang and Yufei Zha",

note = "Publisher Copyright: {\textcopyright} 2022 IEEE.",

year = "2023",

doi = "10.1109/TMM.2022.3199109",

language = "英语",

volume = "25",

pages = "5800--5812",

journal = "IEEE Transactions on Multimedia",

issn = "1520-9210",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Look&listen

T2 - Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

AU - Xiong, Junwen

AU - Zhou, Yu

AU - Zhang, Peng

AU - Xie, Lei

AU - Huang, Wei

AU - Zha, Yufei

PY - 2023

Y1 - 2023

N2 - Active speaker detection and speech enhancement have become two increasingly attractive topics in audio-visual scenario understanding. According to their respective characteristics, the scheme of independently designed architecture has been widely used in correspondence to each single task. This may lead to the representation learned by the model being task-specific, and inevitably result in the lack of generalization ability of the feature based on multi-modal modeling. More recent studies have shown that establishing cross-modal relationship between auditory and visual stream is a promising solution for the challenge of audio-visual multi-task learning. Therefore, as a motivation to bridge the multi-modal associations in audio-visual tasks, a unified framework is proposed to achieve target speaker detection and speech enhancement with joint learning of audio-visual modeling in this study. With the assistance of audio-visual channels of videos in challenging real-world scenarios, the proposed method is able to exploit inherent correlations in both audio and visual signals, which is used to further anticipate and model the temporal audio-visual relationships across spatial-temporal space via a cross-modal conformer. In addition, a plug-and-play multi-modal layer normalization is introduced to alleviate the distribution misalignment of multi-modal features. Based on cross-modal circulant fusion, the proposed model is capable to learned all audio-visual representations in a holistic process. Substantial experiments demonstrate that the correlations between different modalities and the associations among diverse tasks can be learned by the optimized model more effectively. In comparison to other state-of-the-art works, the proposed work shows a superior performance for active speaker detection and audio-visual speech enhancement on three benchmark datasets, also with a favorable generalization in diverse challenges.

AB - Active speaker detection and speech enhancement have become two increasingly attractive topics in audio-visual scenario understanding. According to their respective characteristics, the scheme of independently designed architecture has been widely used in correspondence to each single task. This may lead to the representation learned by the model being task-specific, and inevitably result in the lack of generalization ability of the feature based on multi-modal modeling. More recent studies have shown that establishing cross-modal relationship between auditory and visual stream is a promising solution for the challenge of audio-visual multi-task learning. Therefore, as a motivation to bridge the multi-modal associations in audio-visual tasks, a unified framework is proposed to achieve target speaker detection and speech enhancement with joint learning of audio-visual modeling in this study. With the assistance of audio-visual channels of videos in challenging real-world scenarios, the proposed method is able to exploit inherent correlations in both audio and visual signals, which is used to further anticipate and model the temporal audio-visual relationships across spatial-temporal space via a cross-modal conformer. In addition, a plug-and-play multi-modal layer normalization is introduced to alleviate the distribution misalignment of multi-modal features. Based on cross-modal circulant fusion, the proposed model is capable to learned all audio-visual representations in a holistic process. Substantial experiments demonstrate that the correlations between different modalities and the associations among diverse tasks can be learned by the optimized model more effectively. In comparison to other state-of-the-art works, the proposed work shows a superior performance for active speaker detection and audio-visual speech enhancement on three benchmark datasets, also with a favorable generalization in diverse challenges.

KW - Active speaker detection

KW - audio-visual correlation learning

KW - speech enhancement

UR - http://www.scopus.com/inward/record.url?scp=85136845696&partnerID=8YFLogxK

U2 - 10.1109/TMM.2022.3199109

DO - 10.1109/TMM.2022.3199109

M3 - 文章

AN - SCOPUS:85136845696

SN - 1520-9210

VL - 25

SP - 5800

EP - 5812

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

ER -

Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

摘要

访问文件

其它文件与链接

指纹

引用此