RGB-D Scene Recognition via Spatial-Related Multi-Modal Feature Learning

Zhitong Xiong; Yuan Yuan; Qi Wang

doi:10.1109/ACCESS.2019.2932080

RGB-D Scene Recognition via Spatial-Related Multi-Modal Feature Learning

Zhitong Xiong, Yuan Yuan, Qi Wang

光电与智能研究院

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 文章 › 同行评审

18 引用（Scopus）

摘要

RGB-D image-based scene recognition has achieved significant performance improvement with the development of deep learning methods. While convolutional neural networks can learn high-semantic level features for object recognition, these methods still have limitations for RGB-D scene classification. One limitation is that how to learn better multi-modal features for the RGB-D scene recognition is still an open problem. Another limitation is that the scene images are usually not object-centric and with great spatial variability. Thus, vanilla full-image CNN features maybe not optimal for scene recognition. Considering these problems, in this paper, we propose a compact and effective framework for RGB-D scene recognition. Specifically, we make the following contributions: 1) A novel RGB-D scene recognition framework is proposed to explicitly learn the global modal-specific and local modal-consistent features simultaneously. Different from existing approaches, local CNN features are considered for the learning of modal-consistent representations; 2) key Feature Selection (KFS) module is designed, which can adaptively select important local features from the high-semantic level CNN feature maps. It is more efficient and effective than object detection and dense patch-sampling based methods, and; 3) a triplet correlation loss and a spatial-attention similarity loss are proposed for the training of KFS module. Under the supervision of the proposed loss functions, the network can learn import local features of two modalities with no need for extra annotations. Finally, by concatenating the global and local features together, the proposed framework can achieve new state-of-the-art scene recognition performance on the SUN RGB-D dataset and NYU Depth version 2 (NYUD v2) dataset.

源语言	英语
文章编号	8782114
页（从-至）	106739-106747
页数	9
期刊	IEEE Access
卷	7
DOI	https://doi.org/10.1109/ACCESS.2019.2932080
出版状态	已出版 - 2019

访问文件

10.1109/ACCESS.2019.2932080

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{7d4d4a33df7649709acbae18a031be1e,

title = "RGB-D Scene Recognition via Spatial-Related Multi-Modal Feature Learning",

abstract = "RGB-D image-based scene recognition has achieved significant performance improvement with the development of deep learning methods. While convolutional neural networks can learn high-semantic level features for object recognition, these methods still have limitations for RGB-D scene classification. One limitation is that how to learn better multi-modal features for the RGB-D scene recognition is still an open problem. Another limitation is that the scene images are usually not object-centric and with great spatial variability. Thus, vanilla full-image CNN features maybe not optimal for scene recognition. Considering these problems, in this paper, we propose a compact and effective framework for RGB-D scene recognition. Specifically, we make the following contributions: 1) A novel RGB-D scene recognition framework is proposed to explicitly learn the global modal-specific and local modal-consistent features simultaneously. Different from existing approaches, local CNN features are considered for the learning of modal-consistent representations; 2) key Feature Selection (KFS) module is designed, which can adaptively select important local features from the high-semantic level CNN feature maps. It is more efficient and effective than object detection and dense patch-sampling based methods, and; 3) a triplet correlation loss and a spatial-attention similarity loss are proposed for the training of KFS module. Under the supervision of the proposed loss functions, the network can learn import local features of two modalities with no need for extra annotations. Finally, by concatenating the global and local features together, the proposed framework can achieve new state-of-the-art scene recognition performance on the SUN RGB-D dataset and NYU Depth version 2 (NYUD v2) dataset.",

keywords = "global and local features, multi-modal feature learning, RGB-D, scene recognition",

author = "Zhitong Xiong and Yuan Yuan and Qi Wang",

note = "Publisher Copyright: {\textcopyright} 2013 IEEE.",

year = "2019",

doi = "10.1109/ACCESS.2019.2932080",

language = "英语",

volume = "7",

pages = "106739--106747",

journal = "IEEE Access",

issn = "2169-3536",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - RGB-D Scene Recognition via Spatial-Related Multi-Modal Feature Learning

AU - Xiong, Zhitong

AU - Yuan, Yuan

AU - Wang, Qi

PY - 2019

Y1 - 2019

N2 - RGB-D image-based scene recognition has achieved significant performance improvement with the development of deep learning methods. While convolutional neural networks can learn high-semantic level features for object recognition, these methods still have limitations for RGB-D scene classification. One limitation is that how to learn better multi-modal features for the RGB-D scene recognition is still an open problem. Another limitation is that the scene images are usually not object-centric and with great spatial variability. Thus, vanilla full-image CNN features maybe not optimal for scene recognition. Considering these problems, in this paper, we propose a compact and effective framework for RGB-D scene recognition. Specifically, we make the following contributions: 1) A novel RGB-D scene recognition framework is proposed to explicitly learn the global modal-specific and local modal-consistent features simultaneously. Different from existing approaches, local CNN features are considered for the learning of modal-consistent representations; 2) key Feature Selection (KFS) module is designed, which can adaptively select important local features from the high-semantic level CNN feature maps. It is more efficient and effective than object detection and dense patch-sampling based methods, and; 3) a triplet correlation loss and a spatial-attention similarity loss are proposed for the training of KFS module. Under the supervision of the proposed loss functions, the network can learn import local features of two modalities with no need for extra annotations. Finally, by concatenating the global and local features together, the proposed framework can achieve new state-of-the-art scene recognition performance on the SUN RGB-D dataset and NYU Depth version 2 (NYUD v2) dataset.

AB - RGB-D image-based scene recognition has achieved significant performance improvement with the development of deep learning methods. While convolutional neural networks can learn high-semantic level features for object recognition, these methods still have limitations for RGB-D scene classification. One limitation is that how to learn better multi-modal features for the RGB-D scene recognition is still an open problem. Another limitation is that the scene images are usually not object-centric and with great spatial variability. Thus, vanilla full-image CNN features maybe not optimal for scene recognition. Considering these problems, in this paper, we propose a compact and effective framework for RGB-D scene recognition. Specifically, we make the following contributions: 1) A novel RGB-D scene recognition framework is proposed to explicitly learn the global modal-specific and local modal-consistent features simultaneously. Different from existing approaches, local CNN features are considered for the learning of modal-consistent representations; 2) key Feature Selection (KFS) module is designed, which can adaptively select important local features from the high-semantic level CNN feature maps. It is more efficient and effective than object detection and dense patch-sampling based methods, and; 3) a triplet correlation loss and a spatial-attention similarity loss are proposed for the training of KFS module. Under the supervision of the proposed loss functions, the network can learn import local features of two modalities with no need for extra annotations. Finally, by concatenating the global and local features together, the proposed framework can achieve new state-of-the-art scene recognition performance on the SUN RGB-D dataset and NYU Depth version 2 (NYUD v2) dataset.

KW - global and local features

KW - multi-modal feature learning

KW - RGB-D

KW - scene recognition

UR - http://www.scopus.com/inward/record.url?scp=85071110075&partnerID=8YFLogxK

U2 - 10.1109/ACCESS.2019.2932080

DO - 10.1109/ACCESS.2019.2932080

M3 - 文章

AN - SCOPUS:85071110075

SN - 2169-3536

VL - 7

SP - 106739

EP - 106747

JO - IEEE Access

JF - IEEE Access

M1 - 8782114

ER -

RGB-D Scene Recognition via Spatial-Related Multi-Modal Feature Learning

摘要

访问文件

其它文件与链接

指纹

引用此