Enhanced Window-Based Self-Attention with Global and Multi-Scale Representations for Remote Sensing Image Super-Resolution

Yuting Lu; Shunzhou Wang; Binglu Wang; Xin Zhang; Xiaoxu Wang; Yongqiang Zhao

doi:10.3390/rs16152837

Enhanced Window-Based Self-Attention with Global and Multi-Scale Representations for Remote Sensing Image Super-Resolution

Yuting Lu, Shunzhou Wang, Binglu Wang, Xin Zhang, Xiaoxu Wang, Yongqiang Zhao

School of Automation

Research output: Contribution to journal › Article › peer-review

5 Scopus citations

Abstract

Transformers have recently gained significant attention in low-level vision tasks, particularly for remote sensing image super-resolution (RSISR). The vanilla vision transformer aims to establish long-range dependencies between image patches. However, its global receptive field leads to a quadratic increase in computational complexity with respect to spatial size, rendering it inefficient for addressing RSISR tasks that involve processing large-sized images. In an effort to mitigate computational costs, recent studies have explored the utilization of local attention mechanisms, inspired by convolutional neural networks (CNNs), focusing on interactions between patches within small windows. Nevertheless, these approaches are naturally influenced by smaller participating receptive fields, and the utilization of fixed window sizes hinders their ability to perceive multi-scale information, consequently limiting model performance. To address these challenges, we propose a hierarchical transformer model named the Multi-Scale and Global Representation Enhancement-based Transformer (MSGFormer). We propose an efficient attention mechanism, Dual Window-based Self-Attention (DWSA), combining distributed and concentrated attention to balance computational complexity and the receptive field range. Additionally, we incorporated the Multi-scale Depth-wise Convolution Attention (MDCA) module, which is effective in capturing multi-scale features through multi-branch convolution. Furthermore, we developed a new Tracing-Back Structure (TBS), offering tracing-back mechanisms for both proposed attention modules to enhance their feature representation capability. Extensive experiments demonstrate that MSGFormer outperforms state-of-the-art methods on multiple public RSISR datasets by up to 0.11–0.55 dB.

Original language	English
Article number	2837
Journal	Remote Sensing
Volume	16
Issue number	15
DOIs	https://doi.org/10.3390/rs16152837
State	Published - Aug 2024

Keywords

global receptive field
multi-scale representation
remote sensing image super-resolution
transformer model
window-based self-attention

Access to Document

10.3390/rs16152837

Cite this

@article{3c08893a2d70480092462cc896456006,

title = "Enhanced Window-Based Self-Attention with Global and Multi-Scale Representations for Remote Sensing Image Super-Resolution",

abstract = "Transformers have recently gained significant attention in low-level vision tasks, particularly for remote sensing image super-resolution (RSISR). The vanilla vision transformer aims to establish long-range dependencies between image patches. However, its global receptive field leads to a quadratic increase in computational complexity with respect to spatial size, rendering it inefficient for addressing RSISR tasks that involve processing large-sized images. In an effort to mitigate computational costs, recent studies have explored the utilization of local attention mechanisms, inspired by convolutional neural networks (CNNs), focusing on interactions between patches within small windows. Nevertheless, these approaches are naturally influenced by smaller participating receptive fields, and the utilization of fixed window sizes hinders their ability to perceive multi-scale information, consequently limiting model performance. To address these challenges, we propose a hierarchical transformer model named the Multi-Scale and Global Representation Enhancement-based Transformer (MSGFormer). We propose an efficient attention mechanism, Dual Window-based Self-Attention (DWSA), combining distributed and concentrated attention to balance computational complexity and the receptive field range. Additionally, we incorporated the Multi-scale Depth-wise Convolution Attention (MDCA) module, which is effective in capturing multi-scale features through multi-branch convolution. Furthermore, we developed a new Tracing-Back Structure (TBS), offering tracing-back mechanisms for both proposed attention modules to enhance their feature representation capability. Extensive experiments demonstrate that MSGFormer outperforms state-of-the-art methods on multiple public RSISR datasets by up to 0.11–0.55 dB.",

keywords = "global receptive field, multi-scale representation, remote sensing image super-resolution, transformer model, window-based self-attention",

author = "Yuting Lu and Shunzhou Wang and Binglu Wang and Xin Zhang and Xiaoxu Wang and Yongqiang Zhao",

note = "Publisher Copyright: {\textcopyright} 2024 by the authors.",

year = "2024",

month = aug,

doi = "10.3390/rs16152837",

language = "英语",

volume = "16",

journal = "Remote Sensing",

issn = "2072-4292",

publisher = "Multidisciplinary Digital Publishing Institute (MDPI)",

number = "15",

}

TY - JOUR

T1 - Enhanced Window-Based Self-Attention with Global and Multi-Scale Representations for Remote Sensing Image Super-Resolution

AU - Lu, Yuting

AU - Wang, Shunzhou

AU - Wang, Binglu

AU - Zhang, Xin

AU - Wang, Xiaoxu

AU - Zhao, Yongqiang

PY - 2024/8

Y1 - 2024/8

N2 - Transformers have recently gained significant attention in low-level vision tasks, particularly for remote sensing image super-resolution (RSISR). The vanilla vision transformer aims to establish long-range dependencies between image patches. However, its global receptive field leads to a quadratic increase in computational complexity with respect to spatial size, rendering it inefficient for addressing RSISR tasks that involve processing large-sized images. In an effort to mitigate computational costs, recent studies have explored the utilization of local attention mechanisms, inspired by convolutional neural networks (CNNs), focusing on interactions between patches within small windows. Nevertheless, these approaches are naturally influenced by smaller participating receptive fields, and the utilization of fixed window sizes hinders their ability to perceive multi-scale information, consequently limiting model performance. To address these challenges, we propose a hierarchical transformer model named the Multi-Scale and Global Representation Enhancement-based Transformer (MSGFormer). We propose an efficient attention mechanism, Dual Window-based Self-Attention (DWSA), combining distributed and concentrated attention to balance computational complexity and the receptive field range. Additionally, we incorporated the Multi-scale Depth-wise Convolution Attention (MDCA) module, which is effective in capturing multi-scale features through multi-branch convolution. Furthermore, we developed a new Tracing-Back Structure (TBS), offering tracing-back mechanisms for both proposed attention modules to enhance their feature representation capability. Extensive experiments demonstrate that MSGFormer outperforms state-of-the-art methods on multiple public RSISR datasets by up to 0.11–0.55 dB.

AB - Transformers have recently gained significant attention in low-level vision tasks, particularly for remote sensing image super-resolution (RSISR). The vanilla vision transformer aims to establish long-range dependencies between image patches. However, its global receptive field leads to a quadratic increase in computational complexity with respect to spatial size, rendering it inefficient for addressing RSISR tasks that involve processing large-sized images. In an effort to mitigate computational costs, recent studies have explored the utilization of local attention mechanisms, inspired by convolutional neural networks (CNNs), focusing on interactions between patches within small windows. Nevertheless, these approaches are naturally influenced by smaller participating receptive fields, and the utilization of fixed window sizes hinders their ability to perceive multi-scale information, consequently limiting model performance. To address these challenges, we propose a hierarchical transformer model named the Multi-Scale and Global Representation Enhancement-based Transformer (MSGFormer). We propose an efficient attention mechanism, Dual Window-based Self-Attention (DWSA), combining distributed and concentrated attention to balance computational complexity and the receptive field range. Additionally, we incorporated the Multi-scale Depth-wise Convolution Attention (MDCA) module, which is effective in capturing multi-scale features through multi-branch convolution. Furthermore, we developed a new Tracing-Back Structure (TBS), offering tracing-back mechanisms for both proposed attention modules to enhance their feature representation capability. Extensive experiments demonstrate that MSGFormer outperforms state-of-the-art methods on multiple public RSISR datasets by up to 0.11–0.55 dB.

KW - global receptive field

KW - multi-scale representation

KW - remote sensing image super-resolution

KW - transformer model

KW - window-based self-attention

UR - http://www.scopus.com/inward/record.url?scp=85200861955&partnerID=8YFLogxK

U2 - 10.3390/rs16152837

DO - 10.3390/rs16152837

M3 - 文章

AN - SCOPUS:85200861955

SN - 2072-4292

VL - 16

JO - Remote Sensing

JF - Remote Sensing

IS - 15

M1 - 2837

ER -

Enhanced Window-Based Self-Attention with Global and Multi-Scale Representations for Remote Sensing Image Super-Resolution

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this