Dense Multimodal Fusion for Hierarchically Joint Representation

Di Hu, Chengze Wang, Feiping Nie, Xuelong Li

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

38 Scopus citations

Abstract

Multiple modalities can provide more valuable information than single one by describing the same contents in various ways. Previous methods mainly focus on fusing the shallow features or high-level representations generated by unimodal deep networks, which only capture part of the hierarchical correlations across modalities. In this paper, we propose to densely integrate the representations by greedily stacking multiple shared layers between different modality-specific networks, which is named as Dense Multimodal Fusion (DMF). The joint representations in different shared layers can capture the correlations in different levels, and the connection between shared layers also provides an efficient way to learn the dependence among hierarchical correlations. These two properties jointly contribute to the multiple learning paths in DMF, which results in faster convergence, lower training loss, and better performance. We evaluate our model on audiovisual speech recognition and cross-modal retrieval. The noticeable performance demonstrates that our model can learn more effective joint representation.

Original languageEnglish
Title of host publication2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages3941-3945
Number of pages5
ISBN (Electronic)9781479981311
DOIs
StatePublished - May 2019
Event44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Brighton, United Kingdom
Duration: 12 May 201917 May 2019

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume2019-May
ISSN (Print)1520-6149

Conference

Conference44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019
Country/TerritoryUnited Kingdom
CityBrighton
Period12/05/1917/05/19

Keywords

  • Dense Fusion
  • Hierarchical Correlation
  • Multimodal Learning

Fingerprint

Dive into the research topics of 'Dense Multimodal Fusion for Hierarchically Joint Representation'. Together they form a unique fingerprint.

Cite this