MSN: Modality separation networks for RGB-D scene recognition

Zhitong Xiong, Yuan Yuan, Qi Wang

Research output: Contribution to journalArticlepeer-review

24 Scopus citations

Abstract

RGB-D image based indoor scene recognition is a challenging task due to the complex scene layouts and cluttered objects. Although the depth modality can provide extra geometric information, how to better learn the multi-modal features is still an open problem. Considering this, in this paper we propose the modality separation networks to extract the modal-consistent and modal-specific features simultaneously. The motivations of this work are from two aspects: 1) The first one is to learn what is unique to each modality and what is common between the two modalities explicitly; 2) The second one is to explore the relationship between global/local features and modal-specific/consistent features. To this end, the proposed framework contains two branches of submodules to learn the multi-modal features. One branch is used to extract the individual characteristics of each modality by minimizing the similarity between two modalities. Another branch is to learn the common information between two modalities by maximizing the correlation term. Moreover, with the spatial attention module, our method can visualize the spatial positions where different submodules focus on. We evaluate our method on two public RGB-D scene recognition datasets, and new state-of-the-art results are achieved with the proposed framework.

Original languageEnglish
Pages (from-to)81-89
Number of pages9
JournalNeurocomputing
Volume373
DOIs
StatePublished - 15 Jan 2020

Keywords

  • Deep learning
  • Multi-modal feature learning
  • RGB-D
  • Scene recognition

Fingerprint

Dive into the research topics of 'MSN: Modality separation networks for RGB-D scene recognition'. Together they form a unique fingerprint.

Cite this