MSN: Modality separation networks for RGB-D scene recognition

Zhitong Xiong; Yuan Yuan; Qi Wang

doi:10.1016/j.neucom.2019.09.066

MSN: Modality separation networks for RGB-D scene recognition

Zhitong Xiong, Yuan Yuan, Qi Wang

School of Artificial Intelligence, OPtics and Electronics

Northwestern Polytechnical University Xian

Research output: Contribution to journal › Article › peer-review

24 Scopus citations

Abstract

RGB-D image based indoor scene recognition is a challenging task due to the complex scene layouts and cluttered objects. Although the depth modality can provide extra geometric information, how to better learn the multi-modal features is still an open problem. Considering this, in this paper we propose the modality separation networks to extract the modal-consistent and modal-specific features simultaneously. The motivations of this work are from two aspects: 1) The first one is to learn what is unique to each modality and what is common between the two modalities explicitly; 2) The second one is to explore the relationship between global/local features and modal-specific/consistent features. To this end, the proposed framework contains two branches of submodules to learn the multi-modal features. One branch is used to extract the individual characteristics of each modality by minimizing the similarity between two modalities. Another branch is to learn the common information between two modalities by maximizing the correlation term. Moreover, with the spatial attention module, our method can visualize the spatial positions where different submodules focus on. We evaluate our method on two public RGB-D scene recognition datasets, and new state-of-the-art results are achieved with the proposed framework.

Original language	English
Pages (from-to)	81-89
Number of pages	9
Journal	Neurocomputing
Volume	373
DOIs	https://doi.org/10.1016/j.neucom.2019.09.066
State	Published - 15 Jan 2020

Keywords

Deep learning
Multi-modal feature learning
RGB-D
Scene recognition

Access to Document

10.1016/j.neucom.2019.09.066

Cite this

@article{68671784d50d43d9aa8e68fb4d122032,

title = "MSN: Modality separation networks for RGB-D scene recognition",

abstract = "RGB-D image based indoor scene recognition is a challenging task due to the complex scene layouts and cluttered objects. Although the depth modality can provide extra geometric information, how to better learn the multi-modal features is still an open problem. Considering this, in this paper we propose the modality separation networks to extract the modal-consistent and modal-specific features simultaneously. The motivations of this work are from two aspects: 1) The first one is to learn what is unique to each modality and what is common between the two modalities explicitly; 2) The second one is to explore the relationship between global/local features and modal-specific/consistent features. To this end, the proposed framework contains two branches of submodules to learn the multi-modal features. One branch is used to extract the individual characteristics of each modality by minimizing the similarity between two modalities. Another branch is to learn the common information between two modalities by maximizing the correlation term. Moreover, with the spatial attention module, our method can visualize the spatial positions where different submodules focus on. We evaluate our method on two public RGB-D scene recognition datasets, and new state-of-the-art results are achieved with the proposed framework.",

keywords = "Deep learning, Multi-modal feature learning, RGB-D, Scene recognition",

author = "Zhitong Xiong and Yuan Yuan and Qi Wang",

note = "Publisher Copyright: {\textcopyright} 2019 Elsevier B.V.",

year = "2020",

month = jan,

day = "15",

doi = "10.1016/j.neucom.2019.09.066",

language = "英语",

volume = "373",

pages = "81--89",

journal = "Neurocomputing",

issn = "0925-2312",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - MSN

T2 - Modality separation networks for RGB-D scene recognition

AU - Xiong, Zhitong

AU - Yuan, Yuan

AU - Wang, Qi

PY - 2020/1/15

Y1 - 2020/1/15

N2 - RGB-D image based indoor scene recognition is a challenging task due to the complex scene layouts and cluttered objects. Although the depth modality can provide extra geometric information, how to better learn the multi-modal features is still an open problem. Considering this, in this paper we propose the modality separation networks to extract the modal-consistent and modal-specific features simultaneously. The motivations of this work are from two aspects: 1) The first one is to learn what is unique to each modality and what is common between the two modalities explicitly; 2) The second one is to explore the relationship between global/local features and modal-specific/consistent features. To this end, the proposed framework contains two branches of submodules to learn the multi-modal features. One branch is used to extract the individual characteristics of each modality by minimizing the similarity between two modalities. Another branch is to learn the common information between two modalities by maximizing the correlation term. Moreover, with the spatial attention module, our method can visualize the spatial positions where different submodules focus on. We evaluate our method on two public RGB-D scene recognition datasets, and new state-of-the-art results are achieved with the proposed framework.

AB - RGB-D image based indoor scene recognition is a challenging task due to the complex scene layouts and cluttered objects. Although the depth modality can provide extra geometric information, how to better learn the multi-modal features is still an open problem. Considering this, in this paper we propose the modality separation networks to extract the modal-consistent and modal-specific features simultaneously. The motivations of this work are from two aspects: 1) The first one is to learn what is unique to each modality and what is common between the two modalities explicitly; 2) The second one is to explore the relationship between global/local features and modal-specific/consistent features. To this end, the proposed framework contains two branches of submodules to learn the multi-modal features. One branch is used to extract the individual characteristics of each modality by minimizing the similarity between two modalities. Another branch is to learn the common information between two modalities by maximizing the correlation term. Moreover, with the spatial attention module, our method can visualize the spatial positions where different submodules focus on. We evaluate our method on two public RGB-D scene recognition datasets, and new state-of-the-art results are achieved with the proposed framework.

KW - Deep learning

KW - Multi-modal feature learning

KW - RGB-D

KW - Scene recognition

UR - http://www.scopus.com/inward/record.url?scp=85072820504&partnerID=8YFLogxK

U2 - 10.1016/j.neucom.2019.09.066

DO - 10.1016/j.neucom.2019.09.066

M3 - 文章

AN - SCOPUS:85072820504

SN - 0925-2312

VL - 373

SP - 81

EP - 89

JO - Neurocomputing

JF - Neurocomputing

ER -

MSN: Modality separation networks for RGB-D scene recognition

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this