Hierarchical Feature Fusion of Transformer with Patch Dilating for Remote Sensing Scene Classification

Xiaoning Chen; Mingyang Ma; Yong Li; Shaohui Mei; Zonghao Han; Jian Zhao; Wei Cheng

doi:10.1109/TGRS.2023.3331880

Hierarchical Feature Fusion of Transformer with Patch Dilating for Remote Sensing Scene Classification

Xiaoning Chen, Mingyang Ma, Yong Li, Shaohui Mei, Zonghao Han, Jian Zhao, Wei Cheng

电子信息学院

科研成果: 期刊稿件 › 文章 › 同行评审

16 引用（Scopus）

摘要

Recently, the Transformer-based technique has emerged as a promising solution for modeling contextual information in remote sensing (RS) scenes and has found widespread applications in RS scene classification. However, how to make full use of intermediate features learned in Transformers is of crucial importance in the RS scene classification tasks. Therefore, this article proposes a hierarchical feature fusion of transformer with patch dilating (HFFT-PD), which aims to capture rich contextual information from hierarchical features to enhance the performance of RS scene classification. Specifically, the HFFT-PD model consists of a hierarchical transformer merging (HTM) block and a lightweight adaptive channel compression (LACC) module, in which the HTM is specially designed for the Transformer architecture to bridge the semantic gaps between features from different hierarchical blocks, and the LACC accounts for the significance of distinct channels in the ultimate classification features. In addition, a brand-new Patch Dilating strategy is uniquely designed for the Transformer paradigm, functioning as a reassembly operator predicated on patch features. Contrasting with conventional upsampling techniques, Patch Dilating facilitates upsampling without requiring supplementary information, while concurrently preserving the semantic content of local spatial structure. Extensive and rigorous experiments conducted on the UC Merced land-use dataset (UCM), aerial image dataset (AID), and NWPU-45 datasets, with training ratios of 80%, 50%, and 20%, respectively, demonstrate that our proposed HFFT-PD outperforms the baseline at least by 0.59%, 0.44%, and 0.99%, respectively, showcasing the significant superiority of our HFFT-PD over contemporary state-of-the-art methodologies.

源语言	英语
文章编号	4410516
页（从-至）	1-16
页数	16
期刊	IEEE Transactions on Geoscience and Remote Sensing
卷	61
DOI	https://doi.org/10.1109/TGRS.2023.3331880
出版状态	已出版 - 2023

联合国可持续发展目标

此成果有助于实现下列可持续发展目标：

访问文件

10.1109/TGRS.2023.3331880

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{988d83d28349465998c62faddd7504e0,

title = "Hierarchical Feature Fusion of Transformer with Patch Dilating for Remote Sensing Scene Classification",

abstract = "Recently, the Transformer-based technique has emerged as a promising solution for modeling contextual information in remote sensing (RS) scenes and has found widespread applications in RS scene classification. However, how to make full use of intermediate features learned in Transformers is of crucial importance in the RS scene classification tasks. Therefore, this article proposes a hierarchical feature fusion of transformer with patch dilating (HFFT-PD), which aims to capture rich contextual information from hierarchical features to enhance the performance of RS scene classification. Specifically, the HFFT-PD model consists of a hierarchical transformer merging (HTM) block and a lightweight adaptive channel compression (LACC) module, in which the HTM is specially designed for the Transformer architecture to bridge the semantic gaps between features from different hierarchical blocks, and the LACC accounts for the significance of distinct channels in the ultimate classification features. In addition, a brand-new Patch Dilating strategy is uniquely designed for the Transformer paradigm, functioning as a reassembly operator predicated on patch features. Contrasting with conventional upsampling techniques, Patch Dilating facilitates upsampling without requiring supplementary information, while concurrently preserving the semantic content of local spatial structure. Extensive and rigorous experiments conducted on the UC Merced land-use dataset (UCM), aerial image dataset (AID), and NWPU-45 datasets, with training ratios of 80%, 50%, and 20%, respectively, demonstrate that our proposed HFFT-PD outperforms the baseline at least by 0.59%, 0.44%, and 0.99%, respectively, showcasing the significant superiority of our HFFT-PD over contemporary state-of-the-art methodologies.",

keywords = "Feature fusion, remote sensing (RS), scene classification, transformer",

author = "Xiaoning Chen and Mingyang Ma and Yong Li and Shaohui Mei and Zonghao Han and Jian Zhao and Wei Cheng",

note = "Publisher Copyright: {\textcopyright} 1980-2012 IEEE.",

year = "2023",

doi = "10.1109/TGRS.2023.3331880",

language = "英语",

volume = "61",

pages = "1--16",

journal = "IEEE Transactions on Geoscience and Remote Sensing",

issn = "0196-2892",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Hierarchical Feature Fusion of Transformer with Patch Dilating for Remote Sensing Scene Classification

AU - Chen, Xiaoning

AU - Ma, Mingyang

AU - Li, Yong

AU - Mei, Shaohui

AU - Han, Zonghao

AU - Zhao, Jian

AU - Cheng, Wei

PY - 2023

Y1 - 2023

N2 - Recently, the Transformer-based technique has emerged as a promising solution for modeling contextual information in remote sensing (RS) scenes and has found widespread applications in RS scene classification. However, how to make full use of intermediate features learned in Transformers is of crucial importance in the RS scene classification tasks. Therefore, this article proposes a hierarchical feature fusion of transformer with patch dilating (HFFT-PD), which aims to capture rich contextual information from hierarchical features to enhance the performance of RS scene classification. Specifically, the HFFT-PD model consists of a hierarchical transformer merging (HTM) block and a lightweight adaptive channel compression (LACC) module, in which the HTM is specially designed for the Transformer architecture to bridge the semantic gaps between features from different hierarchical blocks, and the LACC accounts for the significance of distinct channels in the ultimate classification features. In addition, a brand-new Patch Dilating strategy is uniquely designed for the Transformer paradigm, functioning as a reassembly operator predicated on patch features. Contrasting with conventional upsampling techniques, Patch Dilating facilitates upsampling without requiring supplementary information, while concurrently preserving the semantic content of local spatial structure. Extensive and rigorous experiments conducted on the UC Merced land-use dataset (UCM), aerial image dataset (AID), and NWPU-45 datasets, with training ratios of 80%, 50%, and 20%, respectively, demonstrate that our proposed HFFT-PD outperforms the baseline at least by 0.59%, 0.44%, and 0.99%, respectively, showcasing the significant superiority of our HFFT-PD over contemporary state-of-the-art methodologies.

AB - Recently, the Transformer-based technique has emerged as a promising solution for modeling contextual information in remote sensing (RS) scenes and has found widespread applications in RS scene classification. However, how to make full use of intermediate features learned in Transformers is of crucial importance in the RS scene classification tasks. Therefore, this article proposes a hierarchical feature fusion of transformer with patch dilating (HFFT-PD), which aims to capture rich contextual information from hierarchical features to enhance the performance of RS scene classification. Specifically, the HFFT-PD model consists of a hierarchical transformer merging (HTM) block and a lightweight adaptive channel compression (LACC) module, in which the HTM is specially designed for the Transformer architecture to bridge the semantic gaps between features from different hierarchical blocks, and the LACC accounts for the significance of distinct channels in the ultimate classification features. In addition, a brand-new Patch Dilating strategy is uniquely designed for the Transformer paradigm, functioning as a reassembly operator predicated on patch features. Contrasting with conventional upsampling techniques, Patch Dilating facilitates upsampling without requiring supplementary information, while concurrently preserving the semantic content of local spatial structure. Extensive and rigorous experiments conducted on the UC Merced land-use dataset (UCM), aerial image dataset (AID), and NWPU-45 datasets, with training ratios of 80%, 50%, and 20%, respectively, demonstrate that our proposed HFFT-PD outperforms the baseline at least by 0.59%, 0.44%, and 0.99%, respectively, showcasing the significant superiority of our HFFT-PD over contemporary state-of-the-art methodologies.

KW - Feature fusion

KW - remote sensing (RS)

KW - scene classification

KW - transformer

UR - http://www.scopus.com/inward/record.url?scp=85177087829&partnerID=8YFLogxK

U2 - 10.1109/TGRS.2023.3331880

DO - 10.1109/TGRS.2023.3331880

M3 - 文章

AN - SCOPUS:85177087829

SN - 0196-2892

VL - 61

SP - 1

EP - 16

JO - IEEE Transactions on Geoscience and Remote Sensing

JF - IEEE Transactions on Geoscience and Remote Sensing

M1 - 4410516

ER -

Hierarchical Feature Fusion of Transformer with Patch Dilating for Remote Sensing Scene Classification

摘要

联合国可持续发展目标

访问文件

其它文件与链接

指纹

引用此