TY - GEN
T1 - Dense Multimodal Fusion for Hierarchically Joint Representation
AU - Hu, Di
AU - Wang, Chengze
AU - Nie, Feiping
AU - Li, Xuelong
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/5
Y1 - 2019/5
N2 - Multiple modalities can provide more valuable information than single one by describing the same contents in various ways. Previous methods mainly focus on fusing the shallow features or high-level representations generated by unimodal deep networks, which only capture part of the hierarchical correlations across modalities. In this paper, we propose to densely integrate the representations by greedily stacking multiple shared layers between different modality-specific networks, which is named as Dense Multimodal Fusion (DMF). The joint representations in different shared layers can capture the correlations in different levels, and the connection between shared layers also provides an efficient way to learn the dependence among hierarchical correlations. These two properties jointly contribute to the multiple learning paths in DMF, which results in faster convergence, lower training loss, and better performance. We evaluate our model on audiovisual speech recognition and cross-modal retrieval. The noticeable performance demonstrates that our model can learn more effective joint representation.
AB - Multiple modalities can provide more valuable information than single one by describing the same contents in various ways. Previous methods mainly focus on fusing the shallow features or high-level representations generated by unimodal deep networks, which only capture part of the hierarchical correlations across modalities. In this paper, we propose to densely integrate the representations by greedily stacking multiple shared layers between different modality-specific networks, which is named as Dense Multimodal Fusion (DMF). The joint representations in different shared layers can capture the correlations in different levels, and the connection between shared layers also provides an efficient way to learn the dependence among hierarchical correlations. These two properties jointly contribute to the multiple learning paths in DMF, which results in faster convergence, lower training loss, and better performance. We evaluate our model on audiovisual speech recognition and cross-modal retrieval. The noticeable performance demonstrates that our model can learn more effective joint representation.
KW - Dense Fusion
KW - Hierarchical Correlation
KW - Multimodal Learning
UR - http://www.scopus.com/inward/record.url?scp=85069431017&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2019.8683898
DO - 10.1109/ICASSP.2019.8683898
M3 - 会议稿件
AN - SCOPUS:85069431017
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 3941
EP - 3945
BT - 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019
Y2 - 12 May 2019 through 17 May 2019
ER -