3D skeleton based action recognition by video-domain translation-scale invariant mapping and multi-scale dilated CNN

Bo Li; Mingyi He; Yuchao Dai; Xuelian Cheng; Yucheng Chen

doi:10.1007/s11042-018-5642-0

3D skeleton based action recognition by video-domain translation-scale invariant mapping and multi-scale dilated CNN

Bo Li, Mingyi He, Yuchao Dai, Xuelian Cheng, Yucheng Chen

School of Electronics and Information

Northwestern Polytechnical University Xian

Research output: Contribution to journal › Article › peer-review

38 Scopus citations

Abstract

In this paper, we present an image classification approach to action recognition with 3D skeleton videos. First, we propose a video domain translation-scale invariant image mapping, which transforms the 3D skeleton videos to color images, namely skeleton images. Second, a multi-scale dilated convolutional neural network (CNN) is designed for the classification of the skeleton images. Our multi-scale dilated CNN model could effectively improve the frequency adaptiveness and exploit the discriminative temporal-spatial cues for the skeleton images. Even though the skeleton images are very different from natural images, we show that the fine-tuning strategy still works well. Furthermore, we propose different kinds of data augmentation strategies to improve the generalization and robustness of our method. Experimental results on popular benchmark datasets such as NTU RGB + D, UTD-MHAD, MSRC-12 and G3D demonstrate the superiority of our approach, which outperforms the state-of-the-art methods by a large margin.

Original language	English
Pages (from-to)	22901-22921
Number of pages	21
Journal	Multimedia Tools and Applications
Volume	77
Issue number	17
DOIs	https://doi.org/10.1007/s11042-018-5642-0
State	Published - 1 Sep 2018

Keywords

3D skeleton
CNN
Image mapping
Recognition

Access to Document

10.1007/s11042-018-5642-0

Cite this

@article{fe6a8f935f8248438863001bf3e8a772,

title = "3D skeleton based action recognition by video-domain translation-scale invariant mapping and multi-scale dilated CNN",

abstract = "In this paper, we present an image classification approach to action recognition with 3D skeleton videos. First, we propose a video domain translation-scale invariant image mapping, which transforms the 3D skeleton videos to color images, namely skeleton images. Second, a multi-scale dilated convolutional neural network (CNN) is designed for the classification of the skeleton images. Our multi-scale dilated CNN model could effectively improve the frequency adaptiveness and exploit the discriminative temporal-spatial cues for the skeleton images. Even though the skeleton images are very different from natural images, we show that the fine-tuning strategy still works well. Furthermore, we propose different kinds of data augmentation strategies to improve the generalization and robustness of our method. Experimental results on popular benchmark datasets such as NTU RGB + D, UTD-MHAD, MSRC-12 and G3D demonstrate the superiority of our approach, which outperforms the state-of-the-art methods by a large margin.",

keywords = "3D skeleton, CNN, Image mapping, Recognition",

author = "Bo Li and Mingyi He and Yuchao Dai and Xuelian Cheng and Yucheng Chen",

note = "Publisher Copyright: {\textcopyright} 2018, Springer Science+Business Media, LLC, part of Springer Nature.",

year = "2018",

month = sep,

day = "1",

doi = "10.1007/s11042-018-5642-0",

language = "英语",

volume = "77",

pages = "22901--22921",

journal = "Multimedia Tools and Applications",

issn = "1380-7501",

publisher = "Springer",

number = "17",

}

TY - JOUR

T1 - 3D skeleton based action recognition by video-domain translation-scale invariant mapping and multi-scale dilated CNN

AU - Li, Bo

AU - He, Mingyi

AU - Dai, Yuchao

AU - Cheng, Xuelian

AU - Chen, Yucheng

PY - 2018/9/1

Y1 - 2018/9/1

N2 - In this paper, we present an image classification approach to action recognition with 3D skeleton videos. First, we propose a video domain translation-scale invariant image mapping, which transforms the 3D skeleton videos to color images, namely skeleton images. Second, a multi-scale dilated convolutional neural network (CNN) is designed for the classification of the skeleton images. Our multi-scale dilated CNN model could effectively improve the frequency adaptiveness and exploit the discriminative temporal-spatial cues for the skeleton images. Even though the skeleton images are very different from natural images, we show that the fine-tuning strategy still works well. Furthermore, we propose different kinds of data augmentation strategies to improve the generalization and robustness of our method. Experimental results on popular benchmark datasets such as NTU RGB + D, UTD-MHAD, MSRC-12 and G3D demonstrate the superiority of our approach, which outperforms the state-of-the-art methods by a large margin.

AB - In this paper, we present an image classification approach to action recognition with 3D skeleton videos. First, we propose a video domain translation-scale invariant image mapping, which transforms the 3D skeleton videos to color images, namely skeleton images. Second, a multi-scale dilated convolutional neural network (CNN) is designed for the classification of the skeleton images. Our multi-scale dilated CNN model could effectively improve the frequency adaptiveness and exploit the discriminative temporal-spatial cues for the skeleton images. Even though the skeleton images are very different from natural images, we show that the fine-tuning strategy still works well. Furthermore, we propose different kinds of data augmentation strategies to improve the generalization and robustness of our method. Experimental results on popular benchmark datasets such as NTU RGB + D, UTD-MHAD, MSRC-12 and G3D demonstrate the superiority of our approach, which outperforms the state-of-the-art methods by a large margin.

KW - 3D skeleton

KW - CNN

KW - Image mapping

KW - Recognition

UR - http://www.scopus.com/inward/record.url?scp=85041558131&partnerID=8YFLogxK

U2 - 10.1007/s11042-018-5642-0

DO - 10.1007/s11042-018-5642-0

M3 - 文章

AN - SCOPUS:85041558131

SN - 1380-7501

VL - 77

SP - 22901

EP - 22921

JO - Multimedia Tools and Applications

JF - Multimedia Tools and Applications

IS - 17

ER -

3D skeleton based action recognition by video-domain translation-scale invariant mapping and multi-scale dilated CNN

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this