Exploring global context and position-aware representation for group activity recognition

Zexing Du; Qing Wang

doi:10.1016/j.imavis.2024.105181

Exploring global context and position-aware representation for group activity recognition

Zexing Du, Qing Wang

School of Computer Science

Northwestern Polytechnical University Xian

Research output: Contribution to journal › Article › peer-review

Abstract

This paper explores the context and position information in the scene for group activity understanding. Firstly, previous group activity recognition methods strive to reason on individual features without considering the information in the scene. Besides correlations among actors, we argue that integrating the scene context simultaneously can afford us more useful and supplementary cues. Therefore, we propose a new network, termed Contextual Transformer Network (CTN), to incorporate global contextual information into individual representations. In addition, the position of individuals also plays a vital role in group activity understanding. Unlike previous methods that explore correlations among individuals semantically, we propose Clustered Position Embedding (CPE) to integrate the spatial structure of actors and produce position-aware representations. Experimental results on two widely used datasets for sports video and social activity (i.e., Volleyball and Collective Activity datasets) show that the proposed method outperforms state-of-the-art approaches. Especially, when using ResNet-18 as the backbone, our method achieves 93.6/93.9% MCA/MPCA on the Volleyball dataset and 95.4/96.3% MCA/MPCA on the Collective Activity dataset.

Original language	English
Article number	105181
Journal	Image and Vision Computing
Volume	149
DOIs	https://doi.org/10.1016/j.imavis.2024.105181
State	Published - Sep 2024

Keywords

Group activity recognition
Position-aware representation
Spatio-temporal representation

Access to Document

10.1016/j.imavis.2024.105181

Cite this

@article{9d1811eeb7c94b539d1e32e9b8907c1b,

title = "Exploring global context and position-aware representation for group activity recognition",

abstract = "This paper explores the context and position information in the scene for group activity understanding. Firstly, previous group activity recognition methods strive to reason on individual features without considering the information in the scene. Besides correlations among actors, we argue that integrating the scene context simultaneously can afford us more useful and supplementary cues. Therefore, we propose a new network, termed Contextual Transformer Network (CTN), to incorporate global contextual information into individual representations. In addition, the position of individuals also plays a vital role in group activity understanding. Unlike previous methods that explore correlations among individuals semantically, we propose Clustered Position Embedding (CPE) to integrate the spatial structure of actors and produce position-aware representations. Experimental results on two widely used datasets for sports video and social activity (i.e., Volleyball and Collective Activity datasets) show that the proposed method outperforms state-of-the-art approaches. Especially, when using ResNet-18 as the backbone, our method achieves 93.6/93.9% MCA/MPCA on the Volleyball dataset and 95.4/96.3% MCA/MPCA on the Collective Activity dataset.",

keywords = "Group activity recognition, Position-aware representation, Spatio-temporal representation",

author = "Zexing Du and Qing Wang",

note = "Publisher Copyright: {\textcopyright} 2024 Elsevier B.V.",

year = "2024",

month = sep,

doi = "10.1016/j.imavis.2024.105181",

language = "英语",

volume = "149",

journal = "Image and Vision Computing",

issn = "0262-8856",

publisher = "Elsevier Ltd",

}

TY - JOUR

T1 - Exploring global context and position-aware representation for group activity recognition

AU - Du, Zexing

AU - Wang, Qing

PY - 2024/9

Y1 - 2024/9

N2 - This paper explores the context and position information in the scene for group activity understanding. Firstly, previous group activity recognition methods strive to reason on individual features without considering the information in the scene. Besides correlations among actors, we argue that integrating the scene context simultaneously can afford us more useful and supplementary cues. Therefore, we propose a new network, termed Contextual Transformer Network (CTN), to incorporate global contextual information into individual representations. In addition, the position of individuals also plays a vital role in group activity understanding. Unlike previous methods that explore correlations among individuals semantically, we propose Clustered Position Embedding (CPE) to integrate the spatial structure of actors and produce position-aware representations. Experimental results on two widely used datasets for sports video and social activity (i.e., Volleyball and Collective Activity datasets) show that the proposed method outperforms state-of-the-art approaches. Especially, when using ResNet-18 as the backbone, our method achieves 93.6/93.9% MCA/MPCA on the Volleyball dataset and 95.4/96.3% MCA/MPCA on the Collective Activity dataset.

AB - This paper explores the context and position information in the scene for group activity understanding. Firstly, previous group activity recognition methods strive to reason on individual features without considering the information in the scene. Besides correlations among actors, we argue that integrating the scene context simultaneously can afford us more useful and supplementary cues. Therefore, we propose a new network, termed Contextual Transformer Network (CTN), to incorporate global contextual information into individual representations. In addition, the position of individuals also plays a vital role in group activity understanding. Unlike previous methods that explore correlations among individuals semantically, we propose Clustered Position Embedding (CPE) to integrate the spatial structure of actors and produce position-aware representations. Experimental results on two widely used datasets for sports video and social activity (i.e., Volleyball and Collective Activity datasets) show that the proposed method outperforms state-of-the-art approaches. Especially, when using ResNet-18 as the backbone, our method achieves 93.6/93.9% MCA/MPCA on the Volleyball dataset and 95.4/96.3% MCA/MPCA on the Collective Activity dataset.

KW - Group activity recognition

KW - Position-aware representation

KW - Spatio-temporal representation

UR - http://www.scopus.com/inward/record.url?scp=85198583033&partnerID=8YFLogxK

U2 - 10.1016/j.imavis.2024.105181

DO - 10.1016/j.imavis.2024.105181

M3 - 文章

AN - SCOPUS:85198583033

SN - 0262-8856

VL - 149

JO - Image and Vision Computing

JF - Image and Vision Computing

M1 - 105181

ER -

Exploring global context and position-aware representation for group activity recognition

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this