Dense Multimodal Fusion for Hierarchically Joint Representation

Di Hu; Chengze Wang; Feiping Nie; Xuelong Li

doi:10.1109/ICASSP.2019.8683898

Dense Multimodal Fusion for Hierarchically Joint Representation

Di Hu, Chengze Wang, Feiping Nie, Xuelong Li

光电与智能研究院

Northwestern Polytechnical University Xian

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

38 引用（Scopus）

摘要

Multiple modalities can provide more valuable information than single one by describing the same contents in various ways. Previous methods mainly focus on fusing the shallow features or high-level representations generated by unimodal deep networks, which only capture part of the hierarchical correlations across modalities. In this paper, we propose to densely integrate the representations by greedily stacking multiple shared layers between different modality-specific networks, which is named as Dense Multimodal Fusion (DMF). The joint representations in different shared layers can capture the correlations in different levels, and the connection between shared layers also provides an efficient way to learn the dependence among hierarchical correlations. These two properties jointly contribute to the multiple learning paths in DMF, which results in faster convergence, lower training loss, and better performance. We evaluate our model on audiovisual speech recognition and cross-modal retrieval. The noticeable performance demonstrates that our model can learn more effective joint representation.

源语言	英语
主期刊名	2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings
出版商	Institute of Electrical and Electronics Engineers Inc.
页	3941-3945
页数	5
ISBN（电子版）	9781479981311
DOI	https://doi.org/10.1109/ICASSP.2019.8683898
出版状态	已出版 - 5月 2019
活动	44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Brighton, 英国期限: 12 5月 2019 → 17 5月 2019

出版系列

姓名	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
卷	2019-May
ISSN（印刷版）	1520-6149

会议

会议	44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019
国家/地区	英国
市	Brighton
时期	12/05/19 → 17/05/19

访问文件

10.1109/ICASSP.2019.8683898

其它文件与链接

链接到 Scopus 的出版物

引用此

Hu, D., Wang, C., Nie, F., & Li, X. (2019). Dense Multimodal Fusion for Hierarchically Joint Representation. 在 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings (页码 3941-3945). 文章 8683898 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; 卷 2019-May). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP.2019.8683898

Hu, Di ; Wang, Chengze ; Nie, Feiping 等. / Dense Multimodal Fusion for Hierarchically Joint Representation. 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. 页码 3941-3945 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

@inproceedings{c881fc4a596b483abb7e0a0b13d447f2,

title = "Dense Multimodal Fusion for Hierarchically Joint Representation",

abstract = "Multiple modalities can provide more valuable information than single one by describing the same contents in various ways. Previous methods mainly focus on fusing the shallow features or high-level representations generated by unimodal deep networks, which only capture part of the hierarchical correlations across modalities. In this paper, we propose to densely integrate the representations by greedily stacking multiple shared layers between different modality-specific networks, which is named as Dense Multimodal Fusion (DMF). The joint representations in different shared layers can capture the correlations in different levels, and the connection between shared layers also provides an efficient way to learn the dependence among hierarchical correlations. These two properties jointly contribute to the multiple learning paths in DMF, which results in faster convergence, lower training loss, and better performance. We evaluate our model on audiovisual speech recognition and cross-modal retrieval. The noticeable performance demonstrates that our model can learn more effective joint representation.",

keywords = "Dense Fusion, Hierarchical Correlation, Multimodal Learning",

author = "Di Hu and Chengze Wang and Feiping Nie and Xuelong Li",

note = "Publisher Copyright: {\textcopyright} 2019 IEEE.; 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 ; Conference date: 12-05-2019 Through 17-05-2019",

year = "2019",

month = may,

doi = "10.1109/ICASSP.2019.8683898",

language = "英语",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "3941--3945",

booktitle = "2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings",

}

Hu, D, Wang, C, Nie, F & Li, X 2019, Dense Multimodal Fusion for Hierarchically Joint Representation. 在 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings., 8683898, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 卷 2019-May, Institute of Electrical and Electronics Engineers Inc., 页码 3941-3945, 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019, Brighton, 英国, 12/05/19. https://doi.org/10.1109/ICASSP.2019.8683898

Dense Multimodal Fusion for Hierarchically Joint Representation. / Hu, Di; Wang, Chengze; Nie, Feiping 等.
2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. 页码 3941-3945 8683898 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; 卷 2019-May).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Dense Multimodal Fusion for Hierarchically Joint Representation

AU - Hu, Di

AU - Wang, Chengze

AU - Nie, Feiping

AU - Li, Xuelong

PY - 2019/5

Y1 - 2019/5

N2 - Multiple modalities can provide more valuable information than single one by describing the same contents in various ways. Previous methods mainly focus on fusing the shallow features or high-level representations generated by unimodal deep networks, which only capture part of the hierarchical correlations across modalities. In this paper, we propose to densely integrate the representations by greedily stacking multiple shared layers between different modality-specific networks, which is named as Dense Multimodal Fusion (DMF). The joint representations in different shared layers can capture the correlations in different levels, and the connection between shared layers also provides an efficient way to learn the dependence among hierarchical correlations. These two properties jointly contribute to the multiple learning paths in DMF, which results in faster convergence, lower training loss, and better performance. We evaluate our model on audiovisual speech recognition and cross-modal retrieval. The noticeable performance demonstrates that our model can learn more effective joint representation.

AB - Multiple modalities can provide more valuable information than single one by describing the same contents in various ways. Previous methods mainly focus on fusing the shallow features or high-level representations generated by unimodal deep networks, which only capture part of the hierarchical correlations across modalities. In this paper, we propose to densely integrate the representations by greedily stacking multiple shared layers between different modality-specific networks, which is named as Dense Multimodal Fusion (DMF). The joint representations in different shared layers can capture the correlations in different levels, and the connection between shared layers also provides an efficient way to learn the dependence among hierarchical correlations. These two properties jointly contribute to the multiple learning paths in DMF, which results in faster convergence, lower training loss, and better performance. We evaluate our model on audiovisual speech recognition and cross-modal retrieval. The noticeable performance demonstrates that our model can learn more effective joint representation.

KW - Dense Fusion

KW - Hierarchical Correlation

KW - Multimodal Learning

UR - http://www.scopus.com/inward/record.url?scp=85069431017&partnerID=8YFLogxK

U2 - 10.1109/ICASSP.2019.8683898

DO - 10.1109/ICASSP.2019.8683898

M3 - 会议稿件

AN - SCOPUS:85069431017

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 3941

EP - 3945

BT - 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019

Y2 - 12 May 2019 through 17 May 2019

ER -

Hu D, Wang C, Nie F, Li X. Dense Multimodal Fusion for Hierarchically Joint Representation. 在 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2019. 页码 3941-3945. 8683898. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.1109/ICASSP.2019.8683898

Dense Multimodal Fusion for Hierarchically Joint Representation

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此