3D skeleton based action recognition by video-domain translation-scale invariant mapping and multi-scale dilated CNN

Bo Li; Mingyi He; Yuchao Dai; Xuelian Cheng; Yucheng Chen

doi:10.1007/s11042-018-5642-0

3D skeleton based action recognition by video-domain translation-scale invariant mapping and multi-scale dilated CNN

Bo Li, Mingyi He, Yuchao Dai, Xuelian Cheng, Yucheng Chen

电子信息学院

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 文章 › 同行评审

38 引用（Scopus）

摘要

In this paper, we present an image classification approach to action recognition with 3D skeleton videos. First, we propose a video domain translation-scale invariant image mapping, which transforms the 3D skeleton videos to color images, namely skeleton images. Second, a multi-scale dilated convolutional neural network (CNN) is designed for the classification of the skeleton images. Our multi-scale dilated CNN model could effectively improve the frequency adaptiveness and exploit the discriminative temporal-spatial cues for the skeleton images. Even though the skeleton images are very different from natural images, we show that the fine-tuning strategy still works well. Furthermore, we propose different kinds of data augmentation strategies to improve the generalization and robustness of our method. Experimental results on popular benchmark datasets such as NTU RGB + D, UTD-MHAD, MSRC-12 and G3D demonstrate the superiority of our approach, which outperforms the state-of-the-art methods by a large margin.

源语言	英语
页（从-至）	22901-22921
页数	21
期刊	Multimedia Tools and Applications
卷	77
期	17
DOI	https://doi.org/10.1007/s11042-018-5642-0
出版状态	已出版 - 1 9月 2018

访问文件

10.1007/s11042-018-5642-0

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{fe6a8f935f8248438863001bf3e8a772,

title = "3D skeleton based action recognition by video-domain translation-scale invariant mapping and multi-scale dilated CNN",

abstract = "In this paper, we present an image classification approach to action recognition with 3D skeleton videos. First, we propose a video domain translation-scale invariant image mapping, which transforms the 3D skeleton videos to color images, namely skeleton images. Second, a multi-scale dilated convolutional neural network (CNN) is designed for the classification of the skeleton images. Our multi-scale dilated CNN model could effectively improve the frequency adaptiveness and exploit the discriminative temporal-spatial cues for the skeleton images. Even though the skeleton images are very different from natural images, we show that the fine-tuning strategy still works well. Furthermore, we propose different kinds of data augmentation strategies to improve the generalization and robustness of our method. Experimental results on popular benchmark datasets such as NTU RGB + D, UTD-MHAD, MSRC-12 and G3D demonstrate the superiority of our approach, which outperforms the state-of-the-art methods by a large margin.",

keywords = "3D skeleton, CNN, Image mapping, Recognition",

author = "Bo Li and Mingyi He and Yuchao Dai and Xuelian Cheng and Yucheng Chen",

note = "Publisher Copyright: {\textcopyright} 2018, Springer Science+Business Media, LLC, part of Springer Nature.",

year = "2018",

month = sep,

day = "1",

doi = "10.1007/s11042-018-5642-0",

language = "英语",

volume = "77",

pages = "22901--22921",

journal = "Multimedia Tools and Applications",

issn = "1380-7501",

publisher = "Springer",

number = "17",

}

TY - JOUR

T1 - 3D skeleton based action recognition by video-domain translation-scale invariant mapping and multi-scale dilated CNN

AU - Li, Bo

AU - He, Mingyi

AU - Dai, Yuchao

AU - Cheng, Xuelian

AU - Chen, Yucheng

PY - 2018/9/1

Y1 - 2018/9/1

N2 - In this paper, we present an image classification approach to action recognition with 3D skeleton videos. First, we propose a video domain translation-scale invariant image mapping, which transforms the 3D skeleton videos to color images, namely skeleton images. Second, a multi-scale dilated convolutional neural network (CNN) is designed for the classification of the skeleton images. Our multi-scale dilated CNN model could effectively improve the frequency adaptiveness and exploit the discriminative temporal-spatial cues for the skeleton images. Even though the skeleton images are very different from natural images, we show that the fine-tuning strategy still works well. Furthermore, we propose different kinds of data augmentation strategies to improve the generalization and robustness of our method. Experimental results on popular benchmark datasets such as NTU RGB + D, UTD-MHAD, MSRC-12 and G3D demonstrate the superiority of our approach, which outperforms the state-of-the-art methods by a large margin.

AB - In this paper, we present an image classification approach to action recognition with 3D skeleton videos. First, we propose a video domain translation-scale invariant image mapping, which transforms the 3D skeleton videos to color images, namely skeleton images. Second, a multi-scale dilated convolutional neural network (CNN) is designed for the classification of the skeleton images. Our multi-scale dilated CNN model could effectively improve the frequency adaptiveness and exploit the discriminative temporal-spatial cues for the skeleton images. Even though the skeleton images are very different from natural images, we show that the fine-tuning strategy still works well. Furthermore, we propose different kinds of data augmentation strategies to improve the generalization and robustness of our method. Experimental results on popular benchmark datasets such as NTU RGB + D, UTD-MHAD, MSRC-12 and G3D demonstrate the superiority of our approach, which outperforms the state-of-the-art methods by a large margin.

KW - 3D skeleton

KW - CNN

KW - Image mapping

KW - Recognition

UR - http://www.scopus.com/inward/record.url?scp=85041558131&partnerID=8YFLogxK

U2 - 10.1007/s11042-018-5642-0

DO - 10.1007/s11042-018-5642-0

M3 - 文章

AN - SCOPUS:85041558131

SN - 1380-7501

VL - 77

SP - 22901

EP - 22921

JO - Multimedia Tools and Applications

JF - Multimedia Tools and Applications

IS - 17

ER -

3D skeleton based action recognition by video-domain translation-scale invariant mapping and multi-scale dilated CNN

摘要

访问文件

其它文件与链接

指纹

引用此