UAVSeg: Dual-Encoder Cross-Scale Attention Network for UAV Images Semantic Segmentation

Zhen Wang; Zhuhong You; Nan Xu; Chuanlei Zhang; De Shuang Huang

doi:10.1109/TGRS.2024.3502401

UAVSeg: Dual-Encoder Cross-Scale Attention Network for UAV Images Semantic Segmentation

Zhen Wang, Zhuhong You, Nan Xu, Chuanlei Zhang, De Shuang Huang

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

Benefiting from the powerful feature extraction and feature correlation modeling capabilities of convolutional neural networks (CNNs) and Transformer models, these techniques have been widely used in unmanned aerial vehicle (UAV) aerial image semantic segmentation tasks. However, the ground objects in aerial images contain feature information with different scales, and existing methods directly cascade low-level visual features and high-level semantic features without processing, resulting in low semantic segmentation precision. To address these challenges, we propose a Dual-Encoder Cross-Scale Attention Network, which efficiently extracts local and global context information from aerial images and performs fine-grained fusion of multi-scale features to improve semantic segmentation performance. Firstly, we introduce the Dual-CNNs-Transformer Encoder, which embeds the Scan-Focus Window Transformer (SFWT) into CNNs as an auxiliary encoder to supplement the local feature information lost in the global context information extraction process. Secondly, the Cross-Scale Lightweight Integration (CSLI) module is designed, which uses Light Dot-Product Attention Mechanism (DPAM) to fusion multi-scale features and reduce model calculation parameters. Lastly, the Linear Multi-Layer Perceptron (LMLP) is used to restore the feature map resolution while expanding the deconvolution receptive field. To validate the effectiveness of the proposed method, we conducted extensive experiments on real aerial scene datasets, including UAVid, Urban Drone, and Aeroscapes. The experimental results show that our method achieves state-of-the-art performance while maintaining superior real-time efficiency. Implementation codes will be available on https://github.com/darkseid-arch/UAVSeg.

源语言	英语
文章编号	3502401
期刊	IEEE Transactions on Geoscience and Remote Sensing
DOI	https://doi.org/10.1109/TGRS.2024.3502401
出版状态	已接受/待刊 - 2024

访问文件

10.1109/TGRS.2024.3502401

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{ecf0eeb2944f4e6892648246b5611aed,

title = "UAVSeg: Dual-Encoder Cross-Scale Attention Network for UAV Images Semantic Segmentation",

abstract = "Benefiting from the powerful feature extraction and feature correlation modeling capabilities of convolutional neural networks (CNNs) and Transformer models, these techniques have been widely used in unmanned aerial vehicle (UAV) aerial image semantic segmentation tasks. However, the ground objects in aerial images contain feature information with different scales, and existing methods directly cascade low-level visual features and high-level semantic features without processing, resulting in low semantic segmentation precision. To address these challenges, we propose a Dual-Encoder Cross-Scale Attention Network, which efficiently extracts local and global context information from aerial images and performs fine-grained fusion of multi-scale features to improve semantic segmentation performance. Firstly, we introduce the Dual-CNNs-Transformer Encoder, which embeds the Scan-Focus Window Transformer (SFWT) into CNNs as an auxiliary encoder to supplement the local feature information lost in the global context information extraction process. Secondly, the Cross-Scale Lightweight Integration (CSLI) module is designed, which uses Light Dot-Product Attention Mechanism (DPAM) to fusion multi-scale features and reduce model calculation parameters. Lastly, the Linear Multi-Layer Perceptron (LMLP) is used to restore the feature map resolution while expanding the deconvolution receptive field. To validate the effectiveness of the proposed method, we conducted extensive experiments on real aerial scene datasets, including UAVid, Urban Drone, and Aeroscapes. The experimental results show that our method achieves state-of-the-art performance while maintaining superior real-time efficiency. Implementation codes will be available on https://github.com/darkseid-arch/UAVSeg.",

keywords = "attention mechanism, feature extraction, multi-layer perceptron, semantic segmentation, UAV aerial images",

author = "Zhen Wang and Zhuhong You and Nan Xu and Chuanlei Zhang and Huang, {De Shuang}",

note = "Publisher Copyright: {\textcopyright} 1980-2012 IEEE.",

year = "2024",

doi = "10.1109/TGRS.2024.3502401",

language = "英语",

journal = "IEEE Transactions on Geoscience and Remote Sensing",

issn = "0196-2892",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - UAVSeg

T2 - Dual-Encoder Cross-Scale Attention Network for UAV Images Semantic Segmentation

AU - Wang, Zhen

AU - You, Zhuhong

AU - Xu, Nan

AU - Zhang, Chuanlei

AU - Huang, De Shuang

PY - 2024

Y1 - 2024

N2 - Benefiting from the powerful feature extraction and feature correlation modeling capabilities of convolutional neural networks (CNNs) and Transformer models, these techniques have been widely used in unmanned aerial vehicle (UAV) aerial image semantic segmentation tasks. However, the ground objects in aerial images contain feature information with different scales, and existing methods directly cascade low-level visual features and high-level semantic features without processing, resulting in low semantic segmentation precision. To address these challenges, we propose a Dual-Encoder Cross-Scale Attention Network, which efficiently extracts local and global context information from aerial images and performs fine-grained fusion of multi-scale features to improve semantic segmentation performance. Firstly, we introduce the Dual-CNNs-Transformer Encoder, which embeds the Scan-Focus Window Transformer (SFWT) into CNNs as an auxiliary encoder to supplement the local feature information lost in the global context information extraction process. Secondly, the Cross-Scale Lightweight Integration (CSLI) module is designed, which uses Light Dot-Product Attention Mechanism (DPAM) to fusion multi-scale features and reduce model calculation parameters. Lastly, the Linear Multi-Layer Perceptron (LMLP) is used to restore the feature map resolution while expanding the deconvolution receptive field. To validate the effectiveness of the proposed method, we conducted extensive experiments on real aerial scene datasets, including UAVid, Urban Drone, and Aeroscapes. The experimental results show that our method achieves state-of-the-art performance while maintaining superior real-time efficiency. Implementation codes will be available on https://github.com/darkseid-arch/UAVSeg.

AB - Benefiting from the powerful feature extraction and feature correlation modeling capabilities of convolutional neural networks (CNNs) and Transformer models, these techniques have been widely used in unmanned aerial vehicle (UAV) aerial image semantic segmentation tasks. However, the ground objects in aerial images contain feature information with different scales, and existing methods directly cascade low-level visual features and high-level semantic features without processing, resulting in low semantic segmentation precision. To address these challenges, we propose a Dual-Encoder Cross-Scale Attention Network, which efficiently extracts local and global context information from aerial images and performs fine-grained fusion of multi-scale features to improve semantic segmentation performance. Firstly, we introduce the Dual-CNNs-Transformer Encoder, which embeds the Scan-Focus Window Transformer (SFWT) into CNNs as an auxiliary encoder to supplement the local feature information lost in the global context information extraction process. Secondly, the Cross-Scale Lightweight Integration (CSLI) module is designed, which uses Light Dot-Product Attention Mechanism (DPAM) to fusion multi-scale features and reduce model calculation parameters. Lastly, the Linear Multi-Layer Perceptron (LMLP) is used to restore the feature map resolution while expanding the deconvolution receptive field. To validate the effectiveness of the proposed method, we conducted extensive experiments on real aerial scene datasets, including UAVid, Urban Drone, and Aeroscapes. The experimental results show that our method achieves state-of-the-art performance while maintaining superior real-time efficiency. Implementation codes will be available on https://github.com/darkseid-arch/UAVSeg.

KW - attention mechanism

KW - feature extraction

KW - multi-layer perceptron

KW - semantic segmentation

KW - UAV aerial images

UR - http://www.scopus.com/inward/record.url?scp=85210081701&partnerID=8YFLogxK

U2 - 10.1109/TGRS.2024.3502401

DO - 10.1109/TGRS.2024.3502401

M3 - 文章

AN - SCOPUS:85210081701

SN - 0196-2892

JO - IEEE Transactions on Geoscience and Remote Sensing

JF - IEEE Transactions on Geoscience and Remote Sensing

M1 - 3502401

ER -

UAVSeg: Dual-Encoder Cross-Scale Attention Network for UAV Images Semantic Segmentation

摘要

访问文件

其它文件与链接

指纹

引用此