Skip to main navigation Skip to search Skip to main content

Mining the Salient Spatio-Temporal Feature with S2TF-Net for action recognition

  • Xiaoxi Liu
  • , Ju Liu
  • , Lingchen Gu
  • , Yafeng Li
  • , Xiaojun Chang
  • , Feiping Nie
  • Shandong University
  • University of Technology Sydney

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Recently, 3D Convolutional Neural Networks (3D ConvNets) have been widely exploited for action recognition and achieved satisfying performance. However, the superior action features are often drowned in numerous irrelevant information, which immensely enhances the difficulty of video representation. To find a generic cost-efficient approach to balance the parameters and performance, we present a novel network to mine the Salient Spatio-Temporal Feature based on 3D ConvNets backbone for action recognition, termed as S2TF-Net. Firstly, we extract the salient features of each 3D residual block by constructing a multi-scale module for Salient Semantic Feature mining (SSF-Module). Then, with the aim of preserving the salient features in pooling operations, we establish a Two-branch Salient Feature Preserving Module (TSFP-Module). Besides, these above two modules with proper loss function can collaborate in an “easy-to-concat” fashion for most 3D ResNet backbones to classify more accurately albeit in the shallower network. Finally, we conduct experiments over three popular action recognition datasets, where our S2TF-Net is competitive compared with the deeper 3D backbones or current state-of-the-art results. Treating the P3D, 3D ResNet, Non-local I3D and X3D as baseline, the proposed method improves them to varying degrees. Particularly, for Non-local I3D ResNet, the proposed S2TF-Net enhances 4.1%, 3.0% and 4.6% in Kinetics-400, UCF101 and HMDB51 datasets, achieving the accuracy of 74.8%, 95.1% and 80.9%. We hope this study will provide useful inspiration and experience for future research about more cost-effective methods. Code is released here: https://github.com/xiaoxiAries/S2TFNet.

Original languageEnglish
Article number117381
JournalSignal Processing: Image Communication
Volume138
DOIs
StatePublished - Oct 2025

Keywords

  • 3D residual block
  • Action recognition
  • Pooling
  • Salient features
  • Video classification

Fingerprint

Dive into the research topics of 'Mining the Salient Spatio-Temporal Feature with S2TF-Net for action recognition'. Together they form a unique fingerprint.

Cite this