Self-Supervised Global Spatio-Temporal Interaction Pre-Training for Group Activity Recognition

Zexing Du; Xue Wang; Qing Wang

doi:10.1109/TCSVT.2023.3249906

Self-Supervised Global Spatio-Temporal Interaction Pre-Training for Group Activity Recognition

Zexing Du, Xue Wang, Qing Wang

计算机学院

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 文章 › 同行评审

12 引用（Scopus）

摘要

This paper focuses on exploring distinctive spatio-temporal representation in a self-supervised manner for group activity recognition. Firstly, previous networks treat spatial- and temporal-aware information as a whole, limiting their abilities to represent complex spatio-temporal correlations for group activity. Here, we propose the Spatial and Temporal Attention Heads (STAHs) to extract spatial- and temporal-aware representations independently, which generate complementary contexts for boosting group activity understanding. Then, we propose the Global Spatio-Temporal Contrastive (GSTCo) loss to aggregate these two kinds of features. Unlike previous works focusing on the individual temporal consistency while overlooking the correlations between actors, i.e., in a local perspective, we explore the global spatial and temporal dependency. Moreover, GSTCo could effectively avoid the trivial solution faced in contrastive learning by achieving the right balance between spatial and temporal representations. Furthermore, our method imports affordable overhead during pre-training, without additional parameters or computational costs in inference, guaranteeing efficiency. By evaluating on widely-used datasets for group activity recognition, our method achieves good performance. State-of-the-art performance is achieved when applying our pre-trained backbone to existing networks. Extensive experiments verify the generalizability of our method.

源语言	英语
页（从-至）	5076-5088
页数	13
期刊	IEEE Transactions on Circuits and Systems for Video Technology
卷	33
期	9
DOI	https://doi.org/10.1109/TCSVT.2023.3249906
出版状态	已出版 - 1 9月 2023

访问文件

10.1109/TCSVT.2023.3249906

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{09a7a4f31c044c38a753d54282a2ec9f,

title = "Self-Supervised Global Spatio-Temporal Interaction Pre-Training for Group Activity Recognition",

abstract = "This paper focuses on exploring distinctive spatio-temporal representation in a self-supervised manner for group activity recognition. Firstly, previous networks treat spatial- and temporal-aware information as a whole, limiting their abilities to represent complex spatio-temporal correlations for group activity. Here, we propose the Spatial and Temporal Attention Heads (STAHs) to extract spatial- and temporal-aware representations independently, which generate complementary contexts for boosting group activity understanding. Then, we propose the Global Spatio-Temporal Contrastive (GSTCo) loss to aggregate these two kinds of features. Unlike previous works focusing on the individual temporal consistency while overlooking the correlations between actors, i.e., in a local perspective, we explore the global spatial and temporal dependency. Moreover, GSTCo could effectively avoid the trivial solution faced in contrastive learning by achieving the right balance between spatial and temporal representations. Furthermore, our method imports affordable overhead during pre-training, without additional parameters or computational costs in inference, guaranteeing efficiency. By evaluating on widely-used datasets for group activity recognition, our method achieves good performance. State-of-the-art performance is achieved when applying our pre-trained backbone to existing networks. Extensive experiments verify the generalizability of our method.",

keywords = "contrastive learning, Group activity recognition, self-supervised learning, spatio-temporal representation",

author = "Zexing Du and Xue Wang and Qing Wang",

note = "Publisher Copyright: {\textcopyright} 1991-2012 IEEE.",

year = "2023",

month = sep,

day = "1",

doi = "10.1109/TCSVT.2023.3249906",

language = "英语",

volume = "33",

pages = "5076--5088",

journal = "IEEE Transactions on Circuits and Systems for Video Technology",

issn = "1051-8215",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "9",

}

TY - JOUR

T1 - Self-Supervised Global Spatio-Temporal Interaction Pre-Training for Group Activity Recognition

AU - Du, Zexing

AU - Wang, Xue

AU - Wang, Qing

PY - 2023/9/1

Y1 - 2023/9/1

N2 - This paper focuses on exploring distinctive spatio-temporal representation in a self-supervised manner for group activity recognition. Firstly, previous networks treat spatial- and temporal-aware information as a whole, limiting their abilities to represent complex spatio-temporal correlations for group activity. Here, we propose the Spatial and Temporal Attention Heads (STAHs) to extract spatial- and temporal-aware representations independently, which generate complementary contexts for boosting group activity understanding. Then, we propose the Global Spatio-Temporal Contrastive (GSTCo) loss to aggregate these two kinds of features. Unlike previous works focusing on the individual temporal consistency while overlooking the correlations between actors, i.e., in a local perspective, we explore the global spatial and temporal dependency. Moreover, GSTCo could effectively avoid the trivial solution faced in contrastive learning by achieving the right balance between spatial and temporal representations. Furthermore, our method imports affordable overhead during pre-training, without additional parameters or computational costs in inference, guaranteeing efficiency. By evaluating on widely-used datasets for group activity recognition, our method achieves good performance. State-of-the-art performance is achieved when applying our pre-trained backbone to existing networks. Extensive experiments verify the generalizability of our method.

AB - This paper focuses on exploring distinctive spatio-temporal representation in a self-supervised manner for group activity recognition. Firstly, previous networks treat spatial- and temporal-aware information as a whole, limiting their abilities to represent complex spatio-temporal correlations for group activity. Here, we propose the Spatial and Temporal Attention Heads (STAHs) to extract spatial- and temporal-aware representations independently, which generate complementary contexts for boosting group activity understanding. Then, we propose the Global Spatio-Temporal Contrastive (GSTCo) loss to aggregate these two kinds of features. Unlike previous works focusing on the individual temporal consistency while overlooking the correlations between actors, i.e., in a local perspective, we explore the global spatial and temporal dependency. Moreover, GSTCo could effectively avoid the trivial solution faced in contrastive learning by achieving the right balance between spatial and temporal representations. Furthermore, our method imports affordable overhead during pre-training, without additional parameters or computational costs in inference, guaranteeing efficiency. By evaluating on widely-used datasets for group activity recognition, our method achieves good performance. State-of-the-art performance is achieved when applying our pre-trained backbone to existing networks. Extensive experiments verify the generalizability of our method.

KW - contrastive learning

KW - Group activity recognition

KW - self-supervised learning

KW - spatio-temporal representation

UR - http://www.scopus.com/inward/record.url?scp=85149401149&partnerID=8YFLogxK

U2 - 10.1109/TCSVT.2023.3249906

DO - 10.1109/TCSVT.2023.3249906

M3 - 文章

AN - SCOPUS:85149401149

SN - 1051-8215

VL - 33

SP - 5076

EP - 5088

JO - IEEE Transactions on Circuits and Systems for Video Technology

JF - IEEE Transactions on Circuits and Systems for Video Technology

IS - 9

ER -

Self-Supervised Global Spatio-Temporal Interaction Pre-Training for Group Activity Recognition

摘要

访问文件

其它文件与链接

指纹

引用此