Congested crowd instance localization with dilated convolutional swin transformer

Junyu Gao; Maoguo Gong; Xuelong Li

doi:10.1016/j.neucom.2022.09.113

Congested crowd instance localization with dilated convolutional swin transformer

Junyu Gao, Maoguo Gong, Xuelong Li

School of Artificial Intelligence, OPtics and Electronics

Research output: Contribution to journal › Article › peer-review

30 Scopus citations

Abstract

Crowd localization is a new computer vision task, evolved from crowd counting. Different from the latter, it provides more precise location information for each instance, not just counting numbers for the whole crowd scene, which brings greater challenges, especially in extremely congested crowd scenes. In this paper, we focus on how to achieve precise instance localization in high-density crowd scenes, and to alleviate the problem that the feature extraction ability of the traditional model is reduced due to the target occlusion, the image blur, etc. To this end, we propose a Dilated Convolutional Swin Transformer (DCST) for congested crowd scenes. Specifically, a window-based vision transformer is introduced into the crowd localization task, which effectively improves the capacity of representation learning. Then, the well-designed dilated convolutional module is inserted into some different stages of the transformer to enhance the large-range contextual information. Extensive experiments evidence the effectiveness of the proposed methods and achieve the state-of-the-art performance on five popular datasets. Especially, the proposed model achieves F1-measure of 77.5% and MAE of 84.2 in terms of localization and counting performance, respectively.

Original language	English
Pages (from-to)	94-103
Number of pages	10
Journal	Neurocomputing
Volume	513
DOIs	https://doi.org/10.1016/j.neucom.2022.09.113
State	Published - 7 Nov 2022

Keywords

Contextual information
Crowd localization
Dilated convolution
Vision transformer

Access to Document

10.1016/j.neucom.2022.09.113

Cite this

@article{e1e8cb2ec83a46329ea5c151d25235ae,

title = "Congested crowd instance localization with dilated convolutional swin transformer",

abstract = "Crowd localization is a new computer vision task, evolved from crowd counting. Different from the latter, it provides more precise location information for each instance, not just counting numbers for the whole crowd scene, which brings greater challenges, especially in extremely congested crowd scenes. In this paper, we focus on how to achieve precise instance localization in high-density crowd scenes, and to alleviate the problem that the feature extraction ability of the traditional model is reduced due to the target occlusion, the image blur, etc. To this end, we propose a Dilated Convolutional Swin Transformer (DCST) for congested crowd scenes. Specifically, a window-based vision transformer is introduced into the crowd localization task, which effectively improves the capacity of representation learning. Then, the well-designed dilated convolutional module is inserted into some different stages of the transformer to enhance the large-range contextual information. Extensive experiments evidence the effectiveness of the proposed methods and achieve the state-of-the-art performance on five popular datasets. Especially, the proposed model achieves F1-measure of 77.5% and MAE of 84.2 in terms of localization and counting performance, respectively.",

keywords = "Contextual information, Crowd localization, Dilated convolution, Vision transformer",

author = "Junyu Gao and Maoguo Gong and Xuelong Li",

note = "Publisher Copyright: {\textcopyright} 2022 Elsevier B.V.",

year = "2022",

month = nov,

day = "7",

doi = "10.1016/j.neucom.2022.09.113",

language = "英语",

volume = "513",

pages = "94--103",

journal = "Neurocomputing",

issn = "0925-2312",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Congested crowd instance localization with dilated convolutional swin transformer

AU - Gao, Junyu

AU - Gong, Maoguo

AU - Li, Xuelong

PY - 2022/11/7

Y1 - 2022/11/7

N2 - Crowd localization is a new computer vision task, evolved from crowd counting. Different from the latter, it provides more precise location information for each instance, not just counting numbers for the whole crowd scene, which brings greater challenges, especially in extremely congested crowd scenes. In this paper, we focus on how to achieve precise instance localization in high-density crowd scenes, and to alleviate the problem that the feature extraction ability of the traditional model is reduced due to the target occlusion, the image blur, etc. To this end, we propose a Dilated Convolutional Swin Transformer (DCST) for congested crowd scenes. Specifically, a window-based vision transformer is introduced into the crowd localization task, which effectively improves the capacity of representation learning. Then, the well-designed dilated convolutional module is inserted into some different stages of the transformer to enhance the large-range contextual information. Extensive experiments evidence the effectiveness of the proposed methods and achieve the state-of-the-art performance on five popular datasets. Especially, the proposed model achieves F1-measure of 77.5% and MAE of 84.2 in terms of localization and counting performance, respectively.

AB - Crowd localization is a new computer vision task, evolved from crowd counting. Different from the latter, it provides more precise location information for each instance, not just counting numbers for the whole crowd scene, which brings greater challenges, especially in extremely congested crowd scenes. In this paper, we focus on how to achieve precise instance localization in high-density crowd scenes, and to alleviate the problem that the feature extraction ability of the traditional model is reduced due to the target occlusion, the image blur, etc. To this end, we propose a Dilated Convolutional Swin Transformer (DCST) for congested crowd scenes. Specifically, a window-based vision transformer is introduced into the crowd localization task, which effectively improves the capacity of representation learning. Then, the well-designed dilated convolutional module is inserted into some different stages of the transformer to enhance the large-range contextual information. Extensive experiments evidence the effectiveness of the proposed methods and achieve the state-of-the-art performance on five popular datasets. Especially, the proposed model achieves F1-measure of 77.5% and MAE of 84.2 in terms of localization and counting performance, respectively.

KW - Contextual information

KW - Crowd localization

KW - Dilated convolution

KW - Vision transformer

UR - http://www.scopus.com/inward/record.url?scp=85138790370&partnerID=8YFLogxK

U2 - 10.1016/j.neucom.2022.09.113

DO - 10.1016/j.neucom.2022.09.113

M3 - 文章

AN - SCOPUS:85138790370

SN - 0925-2312

VL - 513

SP - 94

EP - 103

JO - Neurocomputing

JF - Neurocomputing

ER -

Congested crowd instance localization with dilated convolutional swin transformer

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this