Abstract
Recently, 3D Convolutional Neural Networks (3D ConvNets) have been widely exploited for action recognition and achieved satisfying performance. However, the superior action features are often drowned in numerous irrelevant information, which immensely enhances the difficulty of video representation. To find a generic cost-efficient approach to balance the parameters and performance, we present a novel network to mine the Salient Spatio-Temporal Feature based on 3D ConvNets backbone for action recognition, termed as S2TF-Net. Firstly, we extract the salient features of each 3D residual block by constructing a multi-scale module for Salient Semantic Feature mining (SSF-Module). Then, with the aim of preserving the salient features in pooling operations, we establish a Two-branch Salient Feature Preserving Module (TSFP-Module). Besides, these above two modules with proper loss function can collaborate in an “easy-to-concat” fashion for most 3D ResNet backbones to classify more accurately albeit in the shallower network. Finally, we conduct experiments over three popular action recognition datasets, where our S2TF-Net is competitive compared with the deeper 3D backbones or current state-of-the-art results. Treating the P3D, 3D ResNet, Non-local I3D and X3D as baseline, the proposed method improves them to varying degrees. Particularly, for Non-local I3D ResNet, the proposed S2TF-Net enhances 4.1%, 3.0% and 4.6% in Kinetics-400, UCF101 and HMDB51 datasets, achieving the accuracy of 74.8%, 95.1% and 80.9%. We hope this study will provide useful inspiration and experience for future research about more cost-effective methods. Code is released here: https://github.com/xiaoxiAries/S2TFNet.
| Original language | English |
|---|---|
| Article number | 117381 |
| Journal | Signal Processing: Image Communication |
| Volume | 138 |
| DOIs | |
| State | Published - Oct 2025 |
Keywords
- 3D residual block
- Action recognition
- Pooling
- Salient features
- Video classification
Fingerprint
Dive into the research topics of 'Mining the Salient Spatio-Temporal Feature with S2TF-Net for action recognition'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver