TY - JOUR
T1 - Video Crowd Localization With Multifocus Gaussian Neighborhood Attention and a Large-Scale Benchmark
AU - Li, Haopeng
AU - Liu, Lingbo
AU - Yang, Kunlin
AU - Liu, Shinan
AU - Gao, Junyu
AU - Zhao, Bin
AU - Zhang, Rui
AU - Hou, Jun
N1 - Publisher Copyright:
© 1992-2012 IEEE.
PY - 2022
Y1 - 2022
N2 - Video crowd localization is a crucial yet challenging task, which aims to estimate exact locations of human heads in the given crowded videos. To model spatial-temporal dependencies of human mobility, we propose a multi-focus Gaussian neighborhood attention (GNA), which can effectively exploit long-range correspondences while maintaining the spatial topological structure of the input videos. In particular, our GNA can also capture the scale variation of human heads well using the equipped multi-focus mechanism. Based on the multi-focus GNA, we develop a unified neural network called GNANet to accurately locate head centers in video clips by fully aggregating spatial-temporal information via a scene modeling module and a context cross-attention module. Moreover, to facilitate future researches in this field, we introduce a large-scale crowd video benchmark named VSCrowd (https://github.com/HopLee6/VSCrowd), which consists of 60K+ frames captured in various surveillance scenes and 2M+ head annotations. Finally, we conduct extensive experiments on three datasets including our VSCrowd, and the experiment results show that the proposed method is capable to achieve state-of-the-art performance for both video crowd localization and counting.
AB - Video crowd localization is a crucial yet challenging task, which aims to estimate exact locations of human heads in the given crowded videos. To model spatial-temporal dependencies of human mobility, we propose a multi-focus Gaussian neighborhood attention (GNA), which can effectively exploit long-range correspondences while maintaining the spatial topological structure of the input videos. In particular, our GNA can also capture the scale variation of human heads well using the equipped multi-focus mechanism. Based on the multi-focus GNA, we develop a unified neural network called GNANet to accurately locate head centers in video clips by fully aggregating spatial-temporal information via a scene modeling module and a context cross-attention module. Moreover, to facilitate future researches in this field, we introduce a large-scale crowd video benchmark named VSCrowd (https://github.com/HopLee6/VSCrowd), which consists of 60K+ frames captured in various surveillance scenes and 2M+ head annotations. Finally, we conduct extensive experiments on three datasets including our VSCrowd, and the experiment results show that the proposed method is capable to achieve state-of-the-art performance for both video crowd localization and counting.
KW - Gaussian neighborhood attention
KW - spatial-temporal modeling
KW - Video crowd analysis
KW - VSCrowd dataset
UR - http://www.scopus.com/inward/record.url?scp=85138460027&partnerID=8YFLogxK
U2 - 10.1109/TIP.2022.3205210
DO - 10.1109/TIP.2022.3205210
M3 - 文章
C2 - 36103439
AN - SCOPUS:85138460027
SN - 1057-7149
VL - 31
SP - 6032
EP - 6047
JO - IEEE Transactions on Image Processing
JF - IEEE Transactions on Image Processing
ER -