Deep multimodal clustering for unsupervised audiovisual learning

Di Hu, Feiping Nie, Xuelong Li

科研成果: 书/报告/会议事项章节会议稿件同行评审

181 引用 (Scopus)

摘要

The seen birds twitter, the running cars accompany with noise, etc. These naturally audiovisual correspondences provide the possibilities to explore and understand the outside world. However, the mixed multiple objects and sounds make it intractable to perform efficient matching in the unconstrained environment. To settle this problem, we propose to adequately excavate audio and visual components and perform elaborate correspondence learning among them. Concretely, a novel unsupervised audiovisual learning model is proposed, named as Deep Multimodal Clustering (DMC),that synchronously performs sets of clustering with multimodal vectors of convolutional maps in different shared spaces for capturing multiple audiovisual correspondences. And such integrated multimodal clustering network can be effectively trained with max-margin loss in the end-to-end fashion. Amounts of experiments in feature evaluation and audiovisual tasks are performed. The results demonstrate that DMC can learn effective unimodal representation, with which the classifier can even outperform human performance. Further, DMC shows noticeable performance in sound localization, multisource detection, and audiovisual understanding.

源语言英语
主期刊名Proceedings - 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019
出版商IEEE Computer Society
9240-9249
页数10
ISBN(电子版)9781728132938
DOI
出版状态已出版 - 6月 2019
活动32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019 - Long Beach, 美国
期限: 16 6月 201920 6月 2019

出版系列

姓名Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
2019-June
ISSN(印刷版)1063-6919

会议

会议32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019
国家/地区美国
Long Beach
时期16/06/1920/06/19

指纹

探究 'Deep multimodal clustering for unsupervised audiovisual learning' 的科研主题。它们共同构成独一无二的指纹。

引用此