Channel selection and local attention transformer model for semantic segmentation on UAV remote sensing scene

Da Liu; Hao Long; Zhenbao Liu

doi:10.1049/ipr2.13298

Channel selection and local attention transformer model for semantic segmentation on UAV remote sensing scene

Da Liu, Hao Long, Zhenbao Liu

School of Civil Aviation

Beijing Union University

Research output: Contribution to journal › Article › peer-review

Abstract

Compared with common urban landscape semantic segmentation, unmanned aerial vehicle (UAV) image semantic segmentation is more challenging because small targets have very low pixel percentages and multi-scale features due to the influence of flight altitude. Yet, the commonly used successive grid downsampling strategy in the current transformer-based methods omits some important features of small targets. Furthermore, due to the complex background interference, it can lead to even worse results. In reaction to this, existing strategies aim to maintain superior resolution. Nevertheless, the application of this method incurs considerable computational costs, which brings challenges for the practical applications of UAVs. So it is significant to design a novel framework to balance retaining more pixels representing small objects during downsampling and reducing computational costs. For this, the Channel Selection and the Local Attention Transformer Model (CSLFormer) are proposed. During the overlap patch embedding process of feature maps, the model allocates half of the important channels to global attention and local attention. These two types of attention focus on different aspects: one learns the relationships and importance among various patches, while the other emphasizes the features of individual patches. The method shows superior performance on two public datasets: AeroScapes and Vaihingen, achieving mean intersection over union (mIoU) of 75.57% and 78.93%, respectively. The proposed CSLFormer has been released on GitHub: https://github.com/leoda1/CSLFormer.

Original language	English
Article number	e13298
Journal	IET Image Processing
Volume	19
Issue number	1
DOIs	https://doi.org/10.1049/ipr2.13298
State	Published - Jan 2025

Keywords

aircraft
computer vision
convolutional neural nets
feedforward neural nets
image segmentation

Access to Document

10.1049/ipr2.13298

Cite this

@article{47ef1c24022740d28579dee3b0d636de,

title = "Channel selection and local attention transformer model for semantic segmentation on UAV remote sensing scene",

abstract = "Compared with common urban landscape semantic segmentation, unmanned aerial vehicle (UAV) image semantic segmentation is more challenging because small targets have very low pixel percentages and multi-scale features due to the influence of flight altitude. Yet, the commonly used successive grid downsampling strategy in the current transformer-based methods omits some important features of small targets. Furthermore, due to the complex background interference, it can lead to even worse results. In reaction to this, existing strategies aim to maintain superior resolution. Nevertheless, the application of this method incurs considerable computational costs, which brings challenges for the practical applications of UAVs. So it is significant to design a novel framework to balance retaining more pixels representing small objects during downsampling and reducing computational costs. For this, the Channel Selection and the Local Attention Transformer Model (CSLFormer) are proposed. During the overlap patch embedding process of feature maps, the model allocates half of the important channels to global attention and local attention. These two types of attention focus on different aspects: one learns the relationships and importance among various patches, while the other emphasizes the features of individual patches. The method shows superior performance on two public datasets: AeroScapes and Vaihingen, achieving mean intersection over union (mIoU) of 75.57% and 78.93%, respectively. The proposed CSLFormer has been released on GitHub: https://github.com/leoda1/CSLFormer.",

keywords = "aircraft, computer vision, convolutional neural nets, feedforward neural nets, image segmentation",

author = "Da Liu and Hao Long and Zhenbao Liu",

note = "Publisher Copyright: {\textcopyright} 2024 The Author(s). IET Image Processing published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology.",

year = "2025",

month = jan,

doi = "10.1049/ipr2.13298",

language = "英语",

volume = "19",

journal = "IET Image Processing",

issn = "1751-9659",

publisher = "John Wiley & Sons Inc.",

number = "1",

}

TY - JOUR

T1 - Channel selection and local attention transformer model for semantic segmentation on UAV remote sensing scene

AU - Liu, Da

AU - Long, Hao

AU - Liu, Zhenbao

PY - 2025/1

Y1 - 2025/1

N2 - Compared with common urban landscape semantic segmentation, unmanned aerial vehicle (UAV) image semantic segmentation is more challenging because small targets have very low pixel percentages and multi-scale features due to the influence of flight altitude. Yet, the commonly used successive grid downsampling strategy in the current transformer-based methods omits some important features of small targets. Furthermore, due to the complex background interference, it can lead to even worse results. In reaction to this, existing strategies aim to maintain superior resolution. Nevertheless, the application of this method incurs considerable computational costs, which brings challenges for the practical applications of UAVs. So it is significant to design a novel framework to balance retaining more pixels representing small objects during downsampling and reducing computational costs. For this, the Channel Selection and the Local Attention Transformer Model (CSLFormer) are proposed. During the overlap patch embedding process of feature maps, the model allocates half of the important channels to global attention and local attention. These two types of attention focus on different aspects: one learns the relationships and importance among various patches, while the other emphasizes the features of individual patches. The method shows superior performance on two public datasets: AeroScapes and Vaihingen, achieving mean intersection over union (mIoU) of 75.57% and 78.93%, respectively. The proposed CSLFormer has been released on GitHub: https://github.com/leoda1/CSLFormer.

AB - Compared with common urban landscape semantic segmentation, unmanned aerial vehicle (UAV) image semantic segmentation is more challenging because small targets have very low pixel percentages and multi-scale features due to the influence of flight altitude. Yet, the commonly used successive grid downsampling strategy in the current transformer-based methods omits some important features of small targets. Furthermore, due to the complex background interference, it can lead to even worse results. In reaction to this, existing strategies aim to maintain superior resolution. Nevertheless, the application of this method incurs considerable computational costs, which brings challenges for the practical applications of UAVs. So it is significant to design a novel framework to balance retaining more pixels representing small objects during downsampling and reducing computational costs. For this, the Channel Selection and the Local Attention Transformer Model (CSLFormer) are proposed. During the overlap patch embedding process of feature maps, the model allocates half of the important channels to global attention and local attention. These two types of attention focus on different aspects: one learns the relationships and importance among various patches, while the other emphasizes the features of individual patches. The method shows superior performance on two public datasets: AeroScapes and Vaihingen, achieving mean intersection over union (mIoU) of 75.57% and 78.93%, respectively. The proposed CSLFormer has been released on GitHub: https://github.com/leoda1/CSLFormer.

KW - aircraft

KW - computer vision

KW - convolutional neural nets

KW - feedforward neural nets

KW - image segmentation

UR - http://www.scopus.com/inward/record.url?scp=85211371806&partnerID=8YFLogxK

U2 - 10.1049/ipr2.13298

DO - 10.1049/ipr2.13298

M3 - 文章

AN - SCOPUS:85211371806

SN - 1751-9659

VL - 19

JO - IET Image Processing

JF - IET Image Processing

IS - 1

M1 - e13298

ER -

Channel selection and local attention transformer model for semantic segmentation on UAV remote sensing scene

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this