Abstract
Recognizing the actions performed in a video is challenging for an intelligent system since there are wide variations and enormous information in the video. Attention mechanism pays attention to key target areas, ignores irrelevant information and extracts more discriminant features. In recent years, attention mechanism has been introduced into video recognition. Although a rich literature has been spawned, most of the research on attention aims to aggregate local features by attention. Instead of feature aggregation, we propose to aggregate decisions based on local spatio-temporal attention regions for action recognition, which is inspired by ensemble learning. The proposed decision fusion module is easy to interpret and architecture-independent. In this article, the regions around the body joints are regarded as the key regions. We use the corresponding regions of the body joints in the 3-D feature maps as the basic local features for local classification. Finally, all the local classification results are combined to make a global decision. Furthermore, when training the network, we can selectively add supervision to the local and global decisions. We experimentally show that the proposed mechanism can improve the recognition performance on multiple datasets which demonstrates its effectiveness.
| Original language | English |
|---|---|
| Article number | 9169922 |
| Pages (from-to) | 2247-2259 |
| Number of pages | 13 |
| Journal | IEEE Transactions on Circuits and Systems for Video Technology |
| Volume | 31 |
| Issue number | 6 |
| DOIs | |
| State | Published - Jun 2021 |
Keywords
- action recognition
- Attention
- decision aggregation
- local decision
Fingerprint
Dive into the research topics of 'Club Ideas and Exertions: Aggregating Local Predictions for Action Recognition'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver