MEET: A Million-Scale Dataset for Fine-Grained Geospatial Scene Classification With Zoom-Free Remote Sensing Imagery

Yansheng Li; Yuning Wu; Gong Cheng; Chao Tao; Bo Dang; Yu Wang; Jiahao Zhang; Chuge Zhang; Yiting Liu; Xu Tang; Jiayi Ma; Yongjun Zhang

doi:10.1109/JAS.2025.125324

MEET: A Million-Scale Dataset for Fine-Grained Geospatial Scene Classification With Zoom-Free Remote Sensing Imagery

Yansheng Li, Yuning Wu, Gong Cheng, Chao Tao, Bo Dang, Yu Wang, Jiahao Zhang, Chuge Zhang, Yiting Liu, Xu Tang, Jiayi Ma, Yongjun Zhang

自动化学院

科研成果: 期刊稿件 › 文章 › 同行评审

2 引用（Scopus）

摘要

Accurate fine-grained geospatial scene classification using remote sensing imagery is essential for a wide range of applications. However, existing approaches often rely on manually zooming remote sensing images at different scales to create typical scene samples. This approach fails to adequately support the fixed-resolution image interpretation requirements in real-world scenarios. To address this limitation, we introduce the million-scale fine-grained geospatial scene classification dataset (MEET), which contains over 1.03 million zoom-free remote sensing scene samples, manually annotated into 80 fine-grained categories. In MEET, each scene sample follows a scene-in-scene layout, where the central scene serves as the reference, and auxiliary scenes provide crucial spatial context for fine-grained classification. Moreover, to tackle the emerging challenge of scene-in-scene classification, we present the context-aware transformer (CAT), a model specifically designed for this task, which adaptively fuses spatial context to accurately classify the scene samples. CAT adaptively fuses spatial context to accurately classify the scene samples by learning attentional features that capture the relationships between the center and auxiliary scenes. Based on MEET, we establish a comprehensive benchmark for fine-grained geospatial scene classification, evaluating CAT against 11 competitive baselines. The results demonstrate that CAT significantly outperforms these baselines, achieving a 1.88% higher balanced accuracy (BA) with the Swin-Large backbone, and a notable 7.87% improvement with the Swin-Huge backbone. Further experiments validate the effectiveness of each module in CAT and show the practical applicability of CAT in the urban functional zone mapping.

源语言	英语
页（从-至）	1004-1023
页数	20
期刊	IEEE/CAA Journal of Automatica Sinica
卷	12
期	5
DOI	https://doi.org/10.1109/JAS.2025.125324
出版状态	已出版 - 2025

访问文件

10.1109/JAS.2025.125324

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{e59ec22885b7449f9702e46cf4610f24,

title = "MEET: A Million-Scale Dataset for Fine-Grained Geospatial Scene Classification With Zoom-Free Remote Sensing Imagery",

abstract = "Accurate fine-grained geospatial scene classification using remote sensing imagery is essential for a wide range of applications. However, existing approaches often rely on manually zooming remote sensing images at different scales to create typical scene samples. This approach fails to adequately support the fixed-resolution image interpretation requirements in real-world scenarios. To address this limitation, we introduce the million-scale fine-grained geospatial scene classification dataset (MEET), which contains over 1.03 million zoom-free remote sensing scene samples, manually annotated into 80 fine-grained categories. In MEET, each scene sample follows a scene-in-scene layout, where the central scene serves as the reference, and auxiliary scenes provide crucial spatial context for fine-grained classification. Moreover, to tackle the emerging challenge of scene-in-scene classification, we present the context-aware transformer (CAT), a model specifically designed for this task, which adaptively fuses spatial context to accurately classify the scene samples. CAT adaptively fuses spatial context to accurately classify the scene samples by learning attentional features that capture the relationships between the center and auxiliary scenes. Based on MEET, we establish a comprehensive benchmark for fine-grained geospatial scene classification, evaluating CAT against 11 competitive baselines. The results demonstrate that CAT significantly outperforms these baselines, achieving a 1.88% higher balanced accuracy (BA) with the Swin-Large backbone, and a notable 7.87% improvement with the Swin-Huge backbone. Further experiments validate the effectiveness of each module in CAT and show the practical applicability of CAT in the urban functional zone mapping.",

keywords = "Fine-grained geospatial scene classification (FGSC), million-scale dataset, remote sensing imagery (RSI), scene-in-scene, transformer",

author = "Yansheng Li and Yuning Wu and Gong Cheng and Chao Tao and Bo Dang and Yu Wang and Jiahao Zhang and Chuge Zhang and Yiting Liu and Xu Tang and Jiayi Ma and Yongjun Zhang",

note = "Publisher Copyright: {\textcopyright} 2014 Chinese Association of Automation.",

year = "2025",

doi = "10.1109/JAS.2025.125324",

language = "英语",

volume = "12",

pages = "1004--1023",

journal = "IEEE/CAA Journal of Automatica Sinica",

issn = "2329-9266",

publisher = "IEEE Advancing Technology for Humanity",

number = "5",

}

TY - JOUR

T1 - MEET

T2 - A Million-Scale Dataset for Fine-Grained Geospatial Scene Classification With Zoom-Free Remote Sensing Imagery

AU - Li, Yansheng

AU - Wu, Yuning

AU - Cheng, Gong

AU - Tao, Chao

AU - Dang, Bo

AU - Wang, Yu

AU - Zhang, Jiahao

AU - Zhang, Chuge

AU - Liu, Yiting

AU - Tang, Xu

AU - Ma, Jiayi

AU - Zhang, Yongjun

PY - 2025

Y1 - 2025

N2 - Accurate fine-grained geospatial scene classification using remote sensing imagery is essential for a wide range of applications. However, existing approaches often rely on manually zooming remote sensing images at different scales to create typical scene samples. This approach fails to adequately support the fixed-resolution image interpretation requirements in real-world scenarios. To address this limitation, we introduce the million-scale fine-grained geospatial scene classification dataset (MEET), which contains over 1.03 million zoom-free remote sensing scene samples, manually annotated into 80 fine-grained categories. In MEET, each scene sample follows a scene-in-scene layout, where the central scene serves as the reference, and auxiliary scenes provide crucial spatial context for fine-grained classification. Moreover, to tackle the emerging challenge of scene-in-scene classification, we present the context-aware transformer (CAT), a model specifically designed for this task, which adaptively fuses spatial context to accurately classify the scene samples. CAT adaptively fuses spatial context to accurately classify the scene samples by learning attentional features that capture the relationships between the center and auxiliary scenes. Based on MEET, we establish a comprehensive benchmark for fine-grained geospatial scene classification, evaluating CAT against 11 competitive baselines. The results demonstrate that CAT significantly outperforms these baselines, achieving a 1.88% higher balanced accuracy (BA) with the Swin-Large backbone, and a notable 7.87% improvement with the Swin-Huge backbone. Further experiments validate the effectiveness of each module in CAT and show the practical applicability of CAT in the urban functional zone mapping.

AB - Accurate fine-grained geospatial scene classification using remote sensing imagery is essential for a wide range of applications. However, existing approaches often rely on manually zooming remote sensing images at different scales to create typical scene samples. This approach fails to adequately support the fixed-resolution image interpretation requirements in real-world scenarios. To address this limitation, we introduce the million-scale fine-grained geospatial scene classification dataset (MEET), which contains over 1.03 million zoom-free remote sensing scene samples, manually annotated into 80 fine-grained categories. In MEET, each scene sample follows a scene-in-scene layout, where the central scene serves as the reference, and auxiliary scenes provide crucial spatial context for fine-grained classification. Moreover, to tackle the emerging challenge of scene-in-scene classification, we present the context-aware transformer (CAT), a model specifically designed for this task, which adaptively fuses spatial context to accurately classify the scene samples. CAT adaptively fuses spatial context to accurately classify the scene samples by learning attentional features that capture the relationships between the center and auxiliary scenes. Based on MEET, we establish a comprehensive benchmark for fine-grained geospatial scene classification, evaluating CAT against 11 competitive baselines. The results demonstrate that CAT significantly outperforms these baselines, achieving a 1.88% higher balanced accuracy (BA) with the Swin-Large backbone, and a notable 7.87% improvement with the Swin-Huge backbone. Further experiments validate the effectiveness of each module in CAT and show the practical applicability of CAT in the urban functional zone mapping.

KW - Fine-grained geospatial scene classification (FGSC)

KW - million-scale dataset

KW - remote sensing imagery (RSI)

KW - scene-in-scene

KW - transformer

UR - http://www.scopus.com/inward/record.url?scp=105005277285&partnerID=8YFLogxK

U2 - 10.1109/JAS.2025.125324

DO - 10.1109/JAS.2025.125324

M3 - 文章

AN - SCOPUS:105005277285

SN - 2329-9266

VL - 12

SP - 1004

EP - 1023

JO - IEEE/CAA Journal of Automatica Sinica

JF - IEEE/CAA Journal of Automatica Sinica

IS - 5

ER -

MEET: A Million-Scale Dataset for Fine-Grained Geospatial Scene Classification With Zoom-Free Remote Sensing Imagery

摘要

访问文件

其它文件与链接

指纹

引用此