TY - JOUR
T1 - Development status and prospects of pretrained foundation models for remote sensing imagery
AU - Zhi, Yuanjie
AU - Jiang, Yiwei
AU - Yang, Zhi
AU - Chen, Yizhou
AU - Hao, Wenkui
AU - Ma, Mingyang
AU - Wei, Jiang
AU - Mei, Shaohui
N1 - Publisher Copyright:
© 中国图象图形学报版权所有
PY - 2026
Y1 - 2026
N2 - Given the continuous expansion of training datasets and the rapid evolution of deep learning architectures, vision foundation models and large language models have demonstrated remarkable generalization and adaptability across diverse downstream tasks, thereby drawing increasing attention from the research community. Within the domain of remote sensing (RS), data exhibit significant heterogeneity across multiple sources, modalities, spatial scales, and temporal dimensions. Designing pretrained RS foundation models (RSFMs) capable of effectively capturing such complex geospatial dependencies is critical for robust feature representation and intelligent interpretation of RS imagery. This paper presents a comprehensive review of the recent progress in pretraining strategies for RSFMs by emphasizing unimodal and multimodal learning paradigms. For unimodal models, representative frameworks based on self-supervised contrastive learning and masked image modeling are summarized. They leverage large-scale optical, hyperspectral, and radar imagery to learn transferable visual representations. These pretraining methods substantially enhance downstream performance in land cover classification, object detection, semantic segmentation, and change detection tasks. For multimodal models, we analyze the integration of image-text, image-location, and image-audio modalities through contrastive alignment strategies and cross-modal embedding learning, thereby effectively improving semantic coherence, generalizability, and interpretability in geospatial representation learning. Furthermore, widely adopted RS pretraining datasets, including their data sources, modality compositions, spatial resolutions, and annotation characteristics, are systematically summarized in this paper. Representative datasets, such as BigEarthNet, SEN12MS, and SkySenseGPT, are reviewed to demonstrate the diversity and scale of existing data resources. The importance of building open, standardized, and reproducible data repositories is emphasized, as these datasets serve as the foundation for training scalable and generalizable RSFMs. From a methodological perspective, this paper discusses the major pretraining paradigms that have shaped the current landscape of RSFMs, including contrastive self-supervised learning, generative self-supervised learning, and hybrid teacher-student distillation. These paradigms aim to maximize representational consistency between augmented views, reconstruct masked information, and align intermediate features between models, thereby enabling the extraction of semantically rich and transferable geospatial features. Despite these advances, several challenges remain unresolved in the development of RSFMs. Data-related issues, such as the scarcity of well-annotated multimodal datasets, geographic and temporal imbalance, and high acquisition costs, continue to hinder large-scale model training. Model scalability poses another limitation, as the billion-parameter-level architectures demand extensive computational resources and energy consumption during training and inference. Moreover, current RSFMs often suffer from limited cross-domain and cross-sensor generalization, thereby leading to performance degradation when applied to new regions or modalities. Transparency and interpretability also remain pressing concerns, as understanding the internal mechanisms of deep RSFMs and improving their robustness against adversarial perturbations are essential for reliable real-world deployment. Future research may address these challenges by focusing on developing scalable multimodal architectures that can jointly process optical, synthetic aperture radar, hyperspectral, and textual data, as well as by designing lightweight RSFMs through model compression, sparse training, and modular architecture optimization. Improving cross-domain and cross-temporal generalization by incorporating domain adaptation, meta-learning, and transfer learning techniques will further enhance model robustness under diverse acquisition conditions. In addition, integrating explainable artificial intelligence approaches, uncertainty quantification, and attention-based visualization can improve the interpretability and trustworthiness of RSFMs, thereby enabling their safe application in operational RS systems. Overall, this paper provides a systematic and forward-looking overview of the current development status, pretraining methodologies, benchmark datasets, and existing challenges of RSFMs. This work aims to offer a theoretical and methodological reference for the future construction of intelligent, scalable, and trustworthy foundation models in the RS domain by consolidating advances in unimodal and multimodal pretraining paradigms.
AB - Given the continuous expansion of training datasets and the rapid evolution of deep learning architectures, vision foundation models and large language models have demonstrated remarkable generalization and adaptability across diverse downstream tasks, thereby drawing increasing attention from the research community. Within the domain of remote sensing (RS), data exhibit significant heterogeneity across multiple sources, modalities, spatial scales, and temporal dimensions. Designing pretrained RS foundation models (RSFMs) capable of effectively capturing such complex geospatial dependencies is critical for robust feature representation and intelligent interpretation of RS imagery. This paper presents a comprehensive review of the recent progress in pretraining strategies for RSFMs by emphasizing unimodal and multimodal learning paradigms. For unimodal models, representative frameworks based on self-supervised contrastive learning and masked image modeling are summarized. They leverage large-scale optical, hyperspectral, and radar imagery to learn transferable visual representations. These pretraining methods substantially enhance downstream performance in land cover classification, object detection, semantic segmentation, and change detection tasks. For multimodal models, we analyze the integration of image-text, image-location, and image-audio modalities through contrastive alignment strategies and cross-modal embedding learning, thereby effectively improving semantic coherence, generalizability, and interpretability in geospatial representation learning. Furthermore, widely adopted RS pretraining datasets, including their data sources, modality compositions, spatial resolutions, and annotation characteristics, are systematically summarized in this paper. Representative datasets, such as BigEarthNet, SEN12MS, and SkySenseGPT, are reviewed to demonstrate the diversity and scale of existing data resources. The importance of building open, standardized, and reproducible data repositories is emphasized, as these datasets serve as the foundation for training scalable and generalizable RSFMs. From a methodological perspective, this paper discusses the major pretraining paradigms that have shaped the current landscape of RSFMs, including contrastive self-supervised learning, generative self-supervised learning, and hybrid teacher-student distillation. These paradigms aim to maximize representational consistency between augmented views, reconstruct masked information, and align intermediate features between models, thereby enabling the extraction of semantically rich and transferable geospatial features. Despite these advances, several challenges remain unresolved in the development of RSFMs. Data-related issues, such as the scarcity of well-annotated multimodal datasets, geographic and temporal imbalance, and high acquisition costs, continue to hinder large-scale model training. Model scalability poses another limitation, as the billion-parameter-level architectures demand extensive computational resources and energy consumption during training and inference. Moreover, current RSFMs often suffer from limited cross-domain and cross-sensor generalization, thereby leading to performance degradation when applied to new regions or modalities. Transparency and interpretability also remain pressing concerns, as understanding the internal mechanisms of deep RSFMs and improving their robustness against adversarial perturbations are essential for reliable real-world deployment. Future research may address these challenges by focusing on developing scalable multimodal architectures that can jointly process optical, synthetic aperture radar, hyperspectral, and textual data, as well as by designing lightweight RSFMs through model compression, sparse training, and modular architecture optimization. Improving cross-domain and cross-temporal generalization by incorporating domain adaptation, meta-learning, and transfer learning techniques will further enhance model robustness under diverse acquisition conditions. In addition, integrating explainable artificial intelligence approaches, uncertainty quantification, and attention-based visualization can improve the interpretability and trustworthiness of RSFMs, thereby enabling their safe application in operational RS systems. Overall, this paper provides a systematic and forward-looking overview of the current development status, pretraining methodologies, benchmark datasets, and existing challenges of RSFMs. This work aims to offer a theoretical and methodological reference for the future construction of intelligent, scalable, and trustworthy foundation models in the RS domain by consolidating advances in unimodal and multimodal pretraining paradigms.
KW - general prediction
KW - multi-tasking
KW - multimodal basic model
KW - pre-trained basic model
KW - remote sensing images
KW - remote sensing intelligent interpretation
UR - https://www.scopus.com/pages/publications/105035736811
U2 - 10.11834/jig.250424
DO - 10.11834/jig.250424
M3 - 文章
AN - SCOPUS:105035736811
SN - 1006-8961
VL - 31
SP - 973
EP - 986
JO - Journal of Image and Graphics
JF - Journal of Image and Graphics
IS - 4
ER -