Club Ideas and Exertions: Aggregating Local Predictions for Action Recognition

Congqi Cao, Jiakang Li, Runping Xi, Yanning Zhang

Research output: Contribution to journalArticlepeer-review

5 Scopus citations

Abstract

Recognizing the actions performed in a video is challenging for an intelligent system since there are wide variations and enormous information in the video. Attention mechanism pays attention to key target areas, ignores irrelevant information and extracts more discriminant features. In recent years, attention mechanism has been introduced into video recognition. Although a rich literature has been spawned, most of the research on attention aims to aggregate local features by attention. Instead of feature aggregation, we propose to aggregate decisions based on local spatio-temporal attention regions for action recognition, which is inspired by ensemble learning. The proposed decision fusion module is easy to interpret and architecture-independent. In this article, the regions around the body joints are regarded as the key regions. We use the corresponding regions of the body joints in the 3-D feature maps as the basic local features for local classification. Finally, all the local classification results are combined to make a global decision. Furthermore, when training the network, we can selectively add supervision to the local and global decisions. We experimentally show that the proposed mechanism can improve the recognition performance on multiple datasets which demonstrates its effectiveness.

Original languageEnglish
Article number9169922
Pages (from-to)2247-2259
Number of pages13
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume31
Issue number6
DOIs
StatePublished - Jun 2021

Keywords

  • action recognition
  • Attention
  • decision aggregation
  • local decision

Fingerprint

Dive into the research topics of 'Club Ideas and Exertions: Aggregating Local Predictions for Action Recognition'. Together they form a unique fingerprint.

Cite this