Going deeper with two-stream ConvNets for action recognition in video surveillance

Yamin Han; Peng Zhang; Tao Zhuo; Wei Huang; Yanning Zhang

doi:10.1016/j.patrec.2017.08.015

Going deeper with two-stream ConvNets for action recognition in video surveillance

Yamin Han, Peng Zhang, Tao Zhuo, Wei Huang, Yanning Zhang

School of Computer Science

Research output: Contribution to journal › Article › peer-review

74 Scopus citations

Abstract

Learning by deep convolutional networks have shown an outstanding effectiveness in a variety of vision based classification tasks, and for which, large datasets are the prerequisites to guarantee its high performance. But in many realistic circumstances, using a massive quantity of training samples to achieve more sophisticated analysis is hard to be fulfilled always, such as human action recognition in videos, and the resulting problem of data deficiency, especially for the labeled data, would critically limit the deeper model structure as a promising solution due to its high risk of overfitting. Additionally, in lacking of high modeling capacity constrained by of model depth, the high-level visual cues like object interaction, scene context and pose variations concurrent with human action also could become the extrinsic and intrinsic challenges for the traditional deep convolutional networks. For the limitations above, in this paper, we proposed a strategy of dataset remodeling by transferring parameters of ResNet-101 layers trained on the ImageNet dataset to initialize learning model and adopt an augmented data variation approach to overcome the overfitting challenge of sample deficiency. For model structure improvement, a novel deeper two-stream ConvNets has been designed for the learning of action complexity. With a dis-order strategy of training/testing video sets, the proposed model and learning strategy are able to collaboratively achieve a significant improvement of action recognition. Experiments on two challenging datasets UCF101 and KTH have verified a superior performance in comparison with other state-of-the-art methods.

Original language	English
Pages (from-to)	83-90
Number of pages	8
Journal	Pattern Recognition Letters
Volume	107
DOIs	https://doi.org/10.1016/j.patrec.2017.08.015
State	Published - 1 May 2018

Keywords

Action recognition
ConvNets
Deeper
Two-stream
Video surveillance

Access to Document

10.1016/j.patrec.2017.08.015

Cite this

@article{4c16e8720bf4429abea77e682fb428fb,

title = "Going deeper with two-stream ConvNets for action recognition in video surveillance",

abstract = "Learning by deep convolutional networks have shown an outstanding effectiveness in a variety of vision based classification tasks, and for which, large datasets are the prerequisites to guarantee its high performance. But in many realistic circumstances, using a massive quantity of training samples to achieve more sophisticated analysis is hard to be fulfilled always, such as human action recognition in videos, and the resulting problem of data deficiency, especially for the labeled data, would critically limit the deeper model structure as a promising solution due to its high risk of overfitting. Additionally, in lacking of high modeling capacity constrained by of model depth, the high-level visual cues like object interaction, scene context and pose variations concurrent with human action also could become the extrinsic and intrinsic challenges for the traditional deep convolutional networks. For the limitations above, in this paper, we proposed a strategy of dataset remodeling by transferring parameters of ResNet-101 layers trained on the ImageNet dataset to initialize learning model and adopt an augmented data variation approach to overcome the overfitting challenge of sample deficiency. For model structure improvement, a novel deeper two-stream ConvNets has been designed for the learning of action complexity. With a dis-order strategy of training/testing video sets, the proposed model and learning strategy are able to collaboratively achieve a significant improvement of action recognition. Experiments on two challenging datasets UCF101 and KTH have verified a superior performance in comparison with other state-of-the-art methods.",

keywords = "Action recognition, ConvNets, Deeper, Two-stream, Video surveillance",

author = "Yamin Han and Peng Zhang and Tao Zhuo and Wei Huang and Yanning Zhang",

note = "Publisher Copyright: {\textcopyright} 2017",

year = "2018",

month = may,

day = "1",

doi = "10.1016/j.patrec.2017.08.015",

language = "英语",

volume = "107",

pages = "83--90",

journal = "Pattern Recognition Letters",

issn = "0167-8655",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Going deeper with two-stream ConvNets for action recognition in video surveillance

AU - Han, Yamin

AU - Zhang, Peng

AU - Zhuo, Tao

AU - Huang, Wei

AU - Zhang, Yanning

PY - 2018/5/1

Y1 - 2018/5/1

N2 - Learning by deep convolutional networks have shown an outstanding effectiveness in a variety of vision based classification tasks, and for which, large datasets are the prerequisites to guarantee its high performance. But in many realistic circumstances, using a massive quantity of training samples to achieve more sophisticated analysis is hard to be fulfilled always, such as human action recognition in videos, and the resulting problem of data deficiency, especially for the labeled data, would critically limit the deeper model structure as a promising solution due to its high risk of overfitting. Additionally, in lacking of high modeling capacity constrained by of model depth, the high-level visual cues like object interaction, scene context and pose variations concurrent with human action also could become the extrinsic and intrinsic challenges for the traditional deep convolutional networks. For the limitations above, in this paper, we proposed a strategy of dataset remodeling by transferring parameters of ResNet-101 layers trained on the ImageNet dataset to initialize learning model and adopt an augmented data variation approach to overcome the overfitting challenge of sample deficiency. For model structure improvement, a novel deeper two-stream ConvNets has been designed for the learning of action complexity. With a dis-order strategy of training/testing video sets, the proposed model and learning strategy are able to collaboratively achieve a significant improvement of action recognition. Experiments on two challenging datasets UCF101 and KTH have verified a superior performance in comparison with other state-of-the-art methods.

AB - Learning by deep convolutional networks have shown an outstanding effectiveness in a variety of vision based classification tasks, and for which, large datasets are the prerequisites to guarantee its high performance. But in many realistic circumstances, using a massive quantity of training samples to achieve more sophisticated analysis is hard to be fulfilled always, such as human action recognition in videos, and the resulting problem of data deficiency, especially for the labeled data, would critically limit the deeper model structure as a promising solution due to its high risk of overfitting. Additionally, in lacking of high modeling capacity constrained by of model depth, the high-level visual cues like object interaction, scene context and pose variations concurrent with human action also could become the extrinsic and intrinsic challenges for the traditional deep convolutional networks. For the limitations above, in this paper, we proposed a strategy of dataset remodeling by transferring parameters of ResNet-101 layers trained on the ImageNet dataset to initialize learning model and adopt an augmented data variation approach to overcome the overfitting challenge of sample deficiency. For model structure improvement, a novel deeper two-stream ConvNets has been designed for the learning of action complexity. With a dis-order strategy of training/testing video sets, the proposed model and learning strategy are able to collaboratively achieve a significant improvement of action recognition. Experiments on two challenging datasets UCF101 and KTH have verified a superior performance in comparison with other state-of-the-art methods.

KW - Action recognition

KW - ConvNets

KW - Deeper

KW - Two-stream

KW - Video surveillance

UR - http://www.scopus.com/inward/record.url?scp=85032492559&partnerID=8YFLogxK

U2 - 10.1016/j.patrec.2017.08.015

DO - 10.1016/j.patrec.2017.08.015

M3 - 文章

AN - SCOPUS:85032492559

SN - 0167-8655

VL - 107

SP - 83

EP - 90

JO - Pattern Recognition Letters

JF - Pattern Recognition Letters

ER -

Going deeper with two-stream ConvNets for action recognition in video surveillance

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this