TY - JOUR
T1 - Club Ideas and Exertions
T2 - Aggregating Local Predictions for Action Recognition
AU - Cao, Congqi
AU - Li, Jiakang
AU - Xi, Runping
AU - Zhang, Yanning
N1 - Publisher Copyright:
© 1991-2012 IEEE.
PY - 2021/6
Y1 - 2021/6
N2 - Recognizing the actions performed in a video is challenging for an intelligent system since there are wide variations and enormous information in the video. Attention mechanism pays attention to key target areas, ignores irrelevant information and extracts more discriminant features. In recent years, attention mechanism has been introduced into video recognition. Although a rich literature has been spawned, most of the research on attention aims to aggregate local features by attention. Instead of feature aggregation, we propose to aggregate decisions based on local spatio-temporal attention regions for action recognition, which is inspired by ensemble learning. The proposed decision fusion module is easy to interpret and architecture-independent. In this article, the regions around the body joints are regarded as the key regions. We use the corresponding regions of the body joints in the 3-D feature maps as the basic local features for local classification. Finally, all the local classification results are combined to make a global decision. Furthermore, when training the network, we can selectively add supervision to the local and global decisions. We experimentally show that the proposed mechanism can improve the recognition performance on multiple datasets which demonstrates its effectiveness.
AB - Recognizing the actions performed in a video is challenging for an intelligent system since there are wide variations and enormous information in the video. Attention mechanism pays attention to key target areas, ignores irrelevant information and extracts more discriminant features. In recent years, attention mechanism has been introduced into video recognition. Although a rich literature has been spawned, most of the research on attention aims to aggregate local features by attention. Instead of feature aggregation, we propose to aggregate decisions based on local spatio-temporal attention regions for action recognition, which is inspired by ensemble learning. The proposed decision fusion module is easy to interpret and architecture-independent. In this article, the regions around the body joints are regarded as the key regions. We use the corresponding regions of the body joints in the 3-D feature maps as the basic local features for local classification. Finally, all the local classification results are combined to make a global decision. Furthermore, when training the network, we can selectively add supervision to the local and global decisions. We experimentally show that the proposed mechanism can improve the recognition performance on multiple datasets which demonstrates its effectiveness.
KW - action recognition
KW - Attention
KW - decision aggregation
KW - local decision
UR - http://www.scopus.com/inward/record.url?scp=85107431691&partnerID=8YFLogxK
U2 - 10.1109/TCSVT.2020.3017203
DO - 10.1109/TCSVT.2020.3017203
M3 - 文章
AN - SCOPUS:85107431691
SN - 1051-8215
VL - 31
SP - 2247
EP - 2259
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 6
M1 - 9169922
ER -