TY - JOUR
T1 - Self-Supervised Global Spatio-Temporal Interaction Pre-Training for Group Activity Recognition
AU - Du, Zexing
AU - Wang, Xue
AU - Wang, Qing
N1 - Publisher Copyright:
© 1991-2012 IEEE.
PY - 2023/9/1
Y1 - 2023/9/1
N2 - This paper focuses on exploring distinctive spatio-temporal representation in a self-supervised manner for group activity recognition. Firstly, previous networks treat spatial- and temporal-aware information as a whole, limiting their abilities to represent complex spatio-temporal correlations for group activity. Here, we propose the Spatial and Temporal Attention Heads (STAHs) to extract spatial- and temporal-aware representations independently, which generate complementary contexts for boosting group activity understanding. Then, we propose the Global Spatio-Temporal Contrastive (GSTCo) loss to aggregate these two kinds of features. Unlike previous works focusing on the individual temporal consistency while overlooking the correlations between actors, i.e., in a local perspective, we explore the global spatial and temporal dependency. Moreover, GSTCo could effectively avoid the trivial solution faced in contrastive learning by achieving the right balance between spatial and temporal representations. Furthermore, our method imports affordable overhead during pre-training, without additional parameters or computational costs in inference, guaranteeing efficiency. By evaluating on widely-used datasets for group activity recognition, our method achieves good performance. State-of-the-art performance is achieved when applying our pre-trained backbone to existing networks. Extensive experiments verify the generalizability of our method.
AB - This paper focuses on exploring distinctive spatio-temporal representation in a self-supervised manner for group activity recognition. Firstly, previous networks treat spatial- and temporal-aware information as a whole, limiting their abilities to represent complex spatio-temporal correlations for group activity. Here, we propose the Spatial and Temporal Attention Heads (STAHs) to extract spatial- and temporal-aware representations independently, which generate complementary contexts for boosting group activity understanding. Then, we propose the Global Spatio-Temporal Contrastive (GSTCo) loss to aggregate these two kinds of features. Unlike previous works focusing on the individual temporal consistency while overlooking the correlations between actors, i.e., in a local perspective, we explore the global spatial and temporal dependency. Moreover, GSTCo could effectively avoid the trivial solution faced in contrastive learning by achieving the right balance between spatial and temporal representations. Furthermore, our method imports affordable overhead during pre-training, without additional parameters or computational costs in inference, guaranteeing efficiency. By evaluating on widely-used datasets for group activity recognition, our method achieves good performance. State-of-the-art performance is achieved when applying our pre-trained backbone to existing networks. Extensive experiments verify the generalizability of our method.
KW - contrastive learning
KW - Group activity recognition
KW - self-supervised learning
KW - spatio-temporal representation
UR - http://www.scopus.com/inward/record.url?scp=85149401149&partnerID=8YFLogxK
U2 - 10.1109/TCSVT.2023.3249906
DO - 10.1109/TCSVT.2023.3249906
M3 - 文章
AN - SCOPUS:85149401149
SN - 1051-8215
VL - 33
SP - 5076
EP - 5088
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 9
ER -