跳到主要导航 跳到搜索 跳到主要内容

Development status and prospects of pretrained foundation models for remote sensing imagery

投稿的翻译标题: 面向遥感图像的预训练基础模型发展现状与展望
  • Yuanjie Zhi
  • , Yiwei Jiang
  • , Zhi Yang
  • , Yizhou Chen
  • , Wenkui Hao
  • , Mingyang Ma
  • , Jiang Wei
  • , Shaohui Mei
  • Northwestern Polytechnical University Xian
  • State Grid Electric Power Research Institute Co., Ltd.

科研成果: 期刊稿件文章同行评审

摘要

Given the continuous expansion of training datasets and the rapid evolution of deep learning architectures, vision foundation models and large language models have demonstrated remarkable generalization and adaptability across diverse downstream tasks, thereby drawing increasing attention from the research community. Within the domain of remote sensing (RS), data exhibit significant heterogeneity across multiple sources, modalities, spatial scales, and temporal dimensions. Designing pretrained RS foundation models (RSFMs) capable of effectively capturing such complex geospatial dependencies is critical for robust feature representation and intelligent interpretation of RS imagery. This paper presents a comprehensive review of the recent progress in pretraining strategies for RSFMs by emphasizing unimodal and multimodal learning paradigms. For unimodal models, representative frameworks based on self-supervised contrastive learning and masked image modeling are summarized. They leverage large-scale optical, hyperspectral, and radar imagery to learn transferable visual representations. These pretraining methods substantially enhance downstream performance in land cover classification, object detection, semantic segmentation, and change detection tasks. For multimodal models, we analyze the integration of image-text, image-location, and image-audio modalities through contrastive alignment strategies and cross-modal embedding learning, thereby effectively improving semantic coherence, generalizability, and interpretability in geospatial representation learning. Furthermore, widely adopted RS pretraining datasets, including their data sources, modality compositions, spatial resolutions, and annotation characteristics, are systematically summarized in this paper. Representative datasets, such as BigEarthNet, SEN12MS, and SkySenseGPT, are reviewed to demonstrate the diversity and scale of existing data resources. The importance of building open, standardized, and reproducible data repositories is emphasized, as these datasets serve as the foundation for training scalable and generalizable RSFMs. From a methodological perspective, this paper discusses the major pretraining paradigms that have shaped the current landscape of RSFMs, including contrastive self-supervised learning, generative self-supervised learning, and hybrid teacher-student distillation. These paradigms aim to maximize representational consistency between augmented views, reconstruct masked information, and align intermediate features between models, thereby enabling the extraction of semantically rich and transferable geospatial features. Despite these advances, several challenges remain unresolved in the development of RSFMs. Data-related issues, such as the scarcity of well-annotated multimodal datasets, geographic and temporal imbalance, and high acquisition costs, continue to hinder large-scale model training. Model scalability poses another limitation, as the billion-parameter-level architectures demand extensive computational resources and energy consumption during training and inference. Moreover, current RSFMs often suffer from limited cross-domain and cross-sensor generalization, thereby leading to performance degradation when applied to new regions or modalities. Transparency and interpretability also remain pressing concerns, as understanding the internal mechanisms of deep RSFMs and improving their robustness against adversarial perturbations are essential for reliable real-world deployment. Future research may address these challenges by focusing on developing scalable multimodal architectures that can jointly process optical, synthetic aperture radar, hyperspectral, and textual data, as well as by designing lightweight RSFMs through model compression, sparse training, and modular architecture optimization. Improving cross-domain and cross-temporal generalization by incorporating domain adaptation, meta-learning, and transfer learning techniques will further enhance model robustness under diverse acquisition conditions. In addition, integrating explainable artificial intelligence approaches, uncertainty quantification, and attention-based visualization can improve the interpretability and trustworthiness of RSFMs, thereby enabling their safe application in operational RS systems. Overall, this paper provides a systematic and forward-looking overview of the current development status, pretraining methodologies, benchmark datasets, and existing challenges of RSFMs. This work aims to offer a theoretical and methodological reference for the future construction of intelligent, scalable, and trustworthy foundation models in the RS domain by consolidating advances in unimodal and multimodal pretraining paradigms.

投稿的翻译标题面向遥感图像的预训练基础模型发展现状与展望
源语言英语
页(从-至)973-986
页数14
期刊Journal of Image and Graphics
31
4
DOI
出版状态已出版 - 2026

联合国可持续发展目标

此成果有助于实现下列可持续发展目标:

  1. 可持续发展目标 7 - 经济适用的清洁能源
    可持续发展目标 7 经济适用的清洁能源

指纹

探究 '面向遥感图像的预训练基础模型发展现状与展望' 的科研主题。它们共同构成独一无二的指纹。

引用此