Exploring global context and position-aware representation for group activity recognition

Zexing Du; Qing Wang

doi:10.1016/j.imavis.2024.105181

Exploring global context and position-aware representation for group activity recognition

Zexing Du, Qing Wang

计算机学院

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

This paper explores the context and position information in the scene for group activity understanding. Firstly, previous group activity recognition methods strive to reason on individual features without considering the information in the scene. Besides correlations among actors, we argue that integrating the scene context simultaneously can afford us more useful and supplementary cues. Therefore, we propose a new network, termed Contextual Transformer Network (CTN), to incorporate global contextual information into individual representations. In addition, the position of individuals also plays a vital role in group activity understanding. Unlike previous methods that explore correlations among individuals semantically, we propose Clustered Position Embedding (CPE) to integrate the spatial structure of actors and produce position-aware representations. Experimental results on two widely used datasets for sports video and social activity (i.e., Volleyball and Collective Activity datasets) show that the proposed method outperforms state-of-the-art approaches. Especially, when using ResNet-18 as the backbone, our method achieves 93.6/93.9% MCA/MPCA on the Volleyball dataset and 95.4/96.3% MCA/MPCA on the Collective Activity dataset.

源语言	英语
文章编号	105181
期刊	Image and Vision Computing
卷	149
DOI	https://doi.org/10.1016/j.imavis.2024.105181
出版状态	已出版 - 9月 2024

访问文件

10.1016/j.imavis.2024.105181

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{9d1811eeb7c94b539d1e32e9b8907c1b,

title = "Exploring global context and position-aware representation for group activity recognition",

abstract = "This paper explores the context and position information in the scene for group activity understanding. Firstly, previous group activity recognition methods strive to reason on individual features without considering the information in the scene. Besides correlations among actors, we argue that integrating the scene context simultaneously can afford us more useful and supplementary cues. Therefore, we propose a new network, termed Contextual Transformer Network (CTN), to incorporate global contextual information into individual representations. In addition, the position of individuals also plays a vital role in group activity understanding. Unlike previous methods that explore correlations among individuals semantically, we propose Clustered Position Embedding (CPE) to integrate the spatial structure of actors and produce position-aware representations. Experimental results on two widely used datasets for sports video and social activity (i.e., Volleyball and Collective Activity datasets) show that the proposed method outperforms state-of-the-art approaches. Especially, when using ResNet-18 as the backbone, our method achieves 93.6/93.9% MCA/MPCA on the Volleyball dataset and 95.4/96.3% MCA/MPCA on the Collective Activity dataset.",

keywords = "Group activity recognition, Position-aware representation, Spatio-temporal representation",

author = "Zexing Du and Qing Wang",

note = "Publisher Copyright: {\textcopyright} 2024 Elsevier B.V.",

year = "2024",

month = sep,

doi = "10.1016/j.imavis.2024.105181",

language = "英语",

volume = "149",

journal = "Image and Vision Computing",

issn = "0262-8856",

publisher = "Elsevier Ltd",

}

TY - JOUR

T1 - Exploring global context and position-aware representation for group activity recognition

AU - Du, Zexing

AU - Wang, Qing

PY - 2024/9

Y1 - 2024/9

N2 - This paper explores the context and position information in the scene for group activity understanding. Firstly, previous group activity recognition methods strive to reason on individual features without considering the information in the scene. Besides correlations among actors, we argue that integrating the scene context simultaneously can afford us more useful and supplementary cues. Therefore, we propose a new network, termed Contextual Transformer Network (CTN), to incorporate global contextual information into individual representations. In addition, the position of individuals also plays a vital role in group activity understanding. Unlike previous methods that explore correlations among individuals semantically, we propose Clustered Position Embedding (CPE) to integrate the spatial structure of actors and produce position-aware representations. Experimental results on two widely used datasets for sports video and social activity (i.e., Volleyball and Collective Activity datasets) show that the proposed method outperforms state-of-the-art approaches. Especially, when using ResNet-18 as the backbone, our method achieves 93.6/93.9% MCA/MPCA on the Volleyball dataset and 95.4/96.3% MCA/MPCA on the Collective Activity dataset.

AB - This paper explores the context and position information in the scene for group activity understanding. Firstly, previous group activity recognition methods strive to reason on individual features without considering the information in the scene. Besides correlations among actors, we argue that integrating the scene context simultaneously can afford us more useful and supplementary cues. Therefore, we propose a new network, termed Contextual Transformer Network (CTN), to incorporate global contextual information into individual representations. In addition, the position of individuals also plays a vital role in group activity understanding. Unlike previous methods that explore correlations among individuals semantically, we propose Clustered Position Embedding (CPE) to integrate the spatial structure of actors and produce position-aware representations. Experimental results on two widely used datasets for sports video and social activity (i.e., Volleyball and Collective Activity datasets) show that the proposed method outperforms state-of-the-art approaches. Especially, when using ResNet-18 as the backbone, our method achieves 93.6/93.9% MCA/MPCA on the Volleyball dataset and 95.4/96.3% MCA/MPCA on the Collective Activity dataset.

KW - Group activity recognition

KW - Position-aware representation

KW - Spatio-temporal representation

UR - http://www.scopus.com/inward/record.url?scp=85198583033&partnerID=8YFLogxK

U2 - 10.1016/j.imavis.2024.105181

DO - 10.1016/j.imavis.2024.105181

M3 - 文章

AN - SCOPUS:85198583033

SN - 0262-8856

VL - 149

JO - Image and Vision Computing

JF - Image and Vision Computing

M1 - 105181

ER -

Exploring global context and position-aware representation for group activity recognition

摘要

访问文件

其它文件与链接

指纹

引用此