Video Crowd Localization With Multifocus Gaussian Neighborhood Attention and a Large-Scale Benchmark

Haopeng Li; Lingbo Liu; Kunlin Yang; Shinan Liu; Junyu Gao; Bin Zhao; Rui Zhang; Jun Hou

doi:10.1109/TIP.2022.3205210

Video Crowd Localization With Multifocus Gaussian Neighborhood Attention and a Large-Scale Benchmark

Haopeng Li, Lingbo Liu, Kunlin Yang, Shinan Liu, Junyu Gao, Bin Zhao, Rui Zhang, Jun Hou

光电与智能研究院

科研成果: 期刊稿件 › 文章 › 同行评审

15 引用（Scopus）

摘要

Video crowd localization is a crucial yet challenging task, which aims to estimate exact locations of human heads in the given crowded videos. To model spatial-temporal dependencies of human mobility, we propose a multi-focus Gaussian neighborhood attention (GNA), which can effectively exploit long-range correspondences while maintaining the spatial topological structure of the input videos. In particular, our GNA can also capture the scale variation of human heads well using the equipped multi-focus mechanism. Based on the multi-focus GNA, we develop a unified neural network called GNANet to accurately locate head centers in video clips by fully aggregating spatial-temporal information via a scene modeling module and a context cross-attention module. Moreover, to facilitate future researches in this field, we introduce a large-scale crowd video benchmark named VSCrowd (https://github.com/HopLee6/VSCrowd), which consists of 60K+ frames captured in various surveillance scenes and 2M+ head annotations. Finally, we conduct extensive experiments on three datasets including our VSCrowd, and the experiment results show that the proposed method is capable to achieve state-of-the-art performance for both video crowd localization and counting.

源语言	英语
页（从-至）	6032-6047
页数	16
期刊	IEEE Transactions on Image Processing
卷	31
DOI	https://doi.org/10.1109/TIP.2022.3205210
出版状态	已出版 - 2022

访问文件

10.1109/TIP.2022.3205210

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{34e3ce6ca4c84947a74e749ee5470e1c,

title = "Video Crowd Localization With Multifocus Gaussian Neighborhood Attention and a Large-Scale Benchmark",

abstract = "Video crowd localization is a crucial yet challenging task, which aims to estimate exact locations of human heads in the given crowded videos. To model spatial-temporal dependencies of human mobility, we propose a multi-focus Gaussian neighborhood attention (GNA), which can effectively exploit long-range correspondences while maintaining the spatial topological structure of the input videos. In particular, our GNA can also capture the scale variation of human heads well using the equipped multi-focus mechanism. Based on the multi-focus GNA, we develop a unified neural network called GNANet to accurately locate head centers in video clips by fully aggregating spatial-temporal information via a scene modeling module and a context cross-attention module. Moreover, to facilitate future researches in this field, we introduce a large-scale crowd video benchmark named VSCrowd (https://github.com/HopLee6/VSCrowd), which consists of 60K+ frames captured in various surveillance scenes and 2M+ head annotations. Finally, we conduct extensive experiments on three datasets including our VSCrowd, and the experiment results show that the proposed method is capable to achieve state-of-the-art performance for both video crowd localization and counting.",

keywords = "Gaussian neighborhood attention, spatial-temporal modeling, Video crowd analysis, VSCrowd dataset",

author = "Haopeng Li and Lingbo Liu and Kunlin Yang and Shinan Liu and Junyu Gao and Bin Zhao and Rui Zhang and Jun Hou",

note = "Publisher Copyright: {\textcopyright} 1992-2012 IEEE.",

year = "2022",

doi = "10.1109/TIP.2022.3205210",

language = "英语",

volume = "31",

pages = "6032--6047",

journal = "IEEE Transactions on Image Processing",

issn = "1057-7149",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Video Crowd Localization With Multifocus Gaussian Neighborhood Attention and a Large-Scale Benchmark

AU - Li, Haopeng

AU - Liu, Lingbo

AU - Yang, Kunlin

AU - Liu, Shinan

AU - Gao, Junyu

AU - Zhao, Bin

AU - Zhang, Rui

AU - Hou, Jun

PY - 2022

Y1 - 2022

N2 - Video crowd localization is a crucial yet challenging task, which aims to estimate exact locations of human heads in the given crowded videos. To model spatial-temporal dependencies of human mobility, we propose a multi-focus Gaussian neighborhood attention (GNA), which can effectively exploit long-range correspondences while maintaining the spatial topological structure of the input videos. In particular, our GNA can also capture the scale variation of human heads well using the equipped multi-focus mechanism. Based on the multi-focus GNA, we develop a unified neural network called GNANet to accurately locate head centers in video clips by fully aggregating spatial-temporal information via a scene modeling module and a context cross-attention module. Moreover, to facilitate future researches in this field, we introduce a large-scale crowd video benchmark named VSCrowd (https://github.com/HopLee6/VSCrowd), which consists of 60K+ frames captured in various surveillance scenes and 2M+ head annotations. Finally, we conduct extensive experiments on three datasets including our VSCrowd, and the experiment results show that the proposed method is capable to achieve state-of-the-art performance for both video crowd localization and counting.

AB - Video crowd localization is a crucial yet challenging task, which aims to estimate exact locations of human heads in the given crowded videos. To model spatial-temporal dependencies of human mobility, we propose a multi-focus Gaussian neighborhood attention (GNA), which can effectively exploit long-range correspondences while maintaining the spatial topological structure of the input videos. In particular, our GNA can also capture the scale variation of human heads well using the equipped multi-focus mechanism. Based on the multi-focus GNA, we develop a unified neural network called GNANet to accurately locate head centers in video clips by fully aggregating spatial-temporal information via a scene modeling module and a context cross-attention module. Moreover, to facilitate future researches in this field, we introduce a large-scale crowd video benchmark named VSCrowd (https://github.com/HopLee6/VSCrowd), which consists of 60K+ frames captured in various surveillance scenes and 2M+ head annotations. Finally, we conduct extensive experiments on three datasets including our VSCrowd, and the experiment results show that the proposed method is capable to achieve state-of-the-art performance for both video crowd localization and counting.

KW - Gaussian neighborhood attention

KW - spatial-temporal modeling

KW - Video crowd analysis

KW - VSCrowd dataset

UR - http://www.scopus.com/inward/record.url?scp=85138460027&partnerID=8YFLogxK

U2 - 10.1109/TIP.2022.3205210

DO - 10.1109/TIP.2022.3205210

M3 - 文章

C2 - 36103439

AN - SCOPUS:85138460027

SN - 1057-7149

VL - 31

SP - 6032

EP - 6047

JO - IEEE Transactions on Image Processing

JF - IEEE Transactions on Image Processing

ER -

Video Crowd Localization With Multifocus Gaussian Neighborhood Attention and a Large-Scale Benchmark

摘要

访问文件

其它文件与链接

指纹

引用此