Congested crowd instance localization with dilated convolutional swin transformer

Junyu Gao; Maoguo Gong; Xuelong Li

doi:10.1016/j.neucom.2022.09.113

Congested crowd instance localization with dilated convolutional swin transformer

Junyu Gao, Maoguo Gong, Xuelong Li

光电与智能研究院

科研成果: 期刊稿件 › 文章 › 同行评审

30 引用（Scopus）

摘要

Crowd localization is a new computer vision task, evolved from crowd counting. Different from the latter, it provides more precise location information for each instance, not just counting numbers for the whole crowd scene, which brings greater challenges, especially in extremely congested crowd scenes. In this paper, we focus on how to achieve precise instance localization in high-density crowd scenes, and to alleviate the problem that the feature extraction ability of the traditional model is reduced due to the target occlusion, the image blur, etc. To this end, we propose a Dilated Convolutional Swin Transformer (DCST) for congested crowd scenes. Specifically, a window-based vision transformer is introduced into the crowd localization task, which effectively improves the capacity of representation learning. Then, the well-designed dilated convolutional module is inserted into some different stages of the transformer to enhance the large-range contextual information. Extensive experiments evidence the effectiveness of the proposed methods and achieve the state-of-the-art performance on five popular datasets. Especially, the proposed model achieves F1-measure of 77.5% and MAE of 84.2 in terms of localization and counting performance, respectively.

源语言	英语
页（从-至）	94-103
页数	10
期刊	Neurocomputing
卷	513
DOI	https://doi.org/10.1016/j.neucom.2022.09.113
出版状态	已出版 - 7 11月 2022

访问文件

10.1016/j.neucom.2022.09.113

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{e1e8cb2ec83a46329ea5c151d25235ae,

title = "Congested crowd instance localization with dilated convolutional swin transformer",

abstract = "Crowd localization is a new computer vision task, evolved from crowd counting. Different from the latter, it provides more precise location information for each instance, not just counting numbers for the whole crowd scene, which brings greater challenges, especially in extremely congested crowd scenes. In this paper, we focus on how to achieve precise instance localization in high-density crowd scenes, and to alleviate the problem that the feature extraction ability of the traditional model is reduced due to the target occlusion, the image blur, etc. To this end, we propose a Dilated Convolutional Swin Transformer (DCST) for congested crowd scenes. Specifically, a window-based vision transformer is introduced into the crowd localization task, which effectively improves the capacity of representation learning. Then, the well-designed dilated convolutional module is inserted into some different stages of the transformer to enhance the large-range contextual information. Extensive experiments evidence the effectiveness of the proposed methods and achieve the state-of-the-art performance on five popular datasets. Especially, the proposed model achieves F1-measure of 77.5% and MAE of 84.2 in terms of localization and counting performance, respectively.",

keywords = "Contextual information, Crowd localization, Dilated convolution, Vision transformer",

author = "Junyu Gao and Maoguo Gong and Xuelong Li",

note = "Publisher Copyright: {\textcopyright} 2022 Elsevier B.V.",

year = "2022",

month = nov,

day = "7",

doi = "10.1016/j.neucom.2022.09.113",

language = "英语",

volume = "513",

pages = "94--103",

journal = "Neurocomputing",

issn = "0925-2312",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Congested crowd instance localization with dilated convolutional swin transformer

AU - Gao, Junyu

AU - Gong, Maoguo

AU - Li, Xuelong

PY - 2022/11/7

Y1 - 2022/11/7

N2 - Crowd localization is a new computer vision task, evolved from crowd counting. Different from the latter, it provides more precise location information for each instance, not just counting numbers for the whole crowd scene, which brings greater challenges, especially in extremely congested crowd scenes. In this paper, we focus on how to achieve precise instance localization in high-density crowd scenes, and to alleviate the problem that the feature extraction ability of the traditional model is reduced due to the target occlusion, the image blur, etc. To this end, we propose a Dilated Convolutional Swin Transformer (DCST) for congested crowd scenes. Specifically, a window-based vision transformer is introduced into the crowd localization task, which effectively improves the capacity of representation learning. Then, the well-designed dilated convolutional module is inserted into some different stages of the transformer to enhance the large-range contextual information. Extensive experiments evidence the effectiveness of the proposed methods and achieve the state-of-the-art performance on five popular datasets. Especially, the proposed model achieves F1-measure of 77.5% and MAE of 84.2 in terms of localization and counting performance, respectively.

AB - Crowd localization is a new computer vision task, evolved from crowd counting. Different from the latter, it provides more precise location information for each instance, not just counting numbers for the whole crowd scene, which brings greater challenges, especially in extremely congested crowd scenes. In this paper, we focus on how to achieve precise instance localization in high-density crowd scenes, and to alleviate the problem that the feature extraction ability of the traditional model is reduced due to the target occlusion, the image blur, etc. To this end, we propose a Dilated Convolutional Swin Transformer (DCST) for congested crowd scenes. Specifically, a window-based vision transformer is introduced into the crowd localization task, which effectively improves the capacity of representation learning. Then, the well-designed dilated convolutional module is inserted into some different stages of the transformer to enhance the large-range contextual information. Extensive experiments evidence the effectiveness of the proposed methods and achieve the state-of-the-art performance on five popular datasets. Especially, the proposed model achieves F1-measure of 77.5% and MAE of 84.2 in terms of localization and counting performance, respectively.

KW - Contextual information

KW - Crowd localization

KW - Dilated convolution

KW - Vision transformer

UR - http://www.scopus.com/inward/record.url?scp=85138790370&partnerID=8YFLogxK

U2 - 10.1016/j.neucom.2022.09.113

DO - 10.1016/j.neucom.2022.09.113

M3 - 文章

AN - SCOPUS:85138790370

SN - 0925-2312

VL - 513

SP - 94

EP - 103

JO - Neurocomputing

JF - Neurocomputing

ER -

Congested crowd instance localization with dilated convolutional swin transformer

摘要

访问文件

其它文件与链接

指纹

引用此