SkyEyeGPT: Unifying remote sensing vision-language tasks via instruction tuning with large language model

Yang Zhan; Zhitong Xiong; Yuan Yuan

doi:10.1016/j.isprsjprs.2025.01.020

SkyEyeGPT: Unifying remote sensing vision-language tasks via instruction tuning with large language model

Yang Zhan, Zhitong Xiong, Yuan Yuan

光电与智能研究院

科研成果: 期刊稿件 › 文章 › 同行评审

2 引用（Scopus）

摘要

Large language models (LLMs) have recently been extended to the vision-language realm, obtaining impressive general multi-modal capabilities. However, the exploration of multi-modal large language models (MLLMs) for remote sensing (RS) data is still in its infancy, lacking datasets and with unsatisfactory performance. In this work, we meticulously curate a large-scale RS multi-modal instruction tuning dataset, including single-task and multi-task conversation instructions. After manual verification, we obtain a high-quality RS instruction-following dataset with 968k samples, namely SkyEye-968k. To this end, we introduce SkyEyeGPT, a unified multi-modal large language model specifically designed for RS multi-granularity vision-language understanding. Our research demonstrates that with a simple yet effective design, SkyEyeGPT works surprisingly well on considerably different tasks without the need for extra encoding modules. Specifically, after projecting RS visual features to the language domain via an alignment layer, they are fed jointly with task-specific instructions into an LLM-based RS decoder to predict answers for RS open-ended tasks. In addition, we design a two-stage tuning method to enhance instruction-following and multi-turn dialogue ability at different granularities. Experiments on 8 datasets for RS vision-language tasks demonstrate SkyEyeGPT's superiority in image-level and region-level tasks, such as captioning and visual grounding. In particular, SkyEyeGPT exhibits encouraging results compared to GPT-4V in some qualitative tests. The online demo, code, and dataset will be released.

源语言	英语
页（从-至）	64-77
页数	14
期刊	ISPRS Journal of Photogrammetry and Remote Sensing
卷	221
DOI	https://doi.org/10.1016/j.isprsjprs.2025.01.020
出版状态	已出版 - 3月 2025

访问文件

10.1016/j.isprsjprs.2025.01.020

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{ef73475c709a4ad7984fdc17da037fe7,

title = "SkyEyeGPT: Unifying remote sensing vision-language tasks via instruction tuning with large language model",

abstract = "Large language models (LLMs) have recently been extended to the vision-language realm, obtaining impressive general multi-modal capabilities. However, the exploration of multi-modal large language models (MLLMs) for remote sensing (RS) data is still in its infancy, lacking datasets and with unsatisfactory performance. In this work, we meticulously curate a large-scale RS multi-modal instruction tuning dataset, including single-task and multi-task conversation instructions. After manual verification, we obtain a high-quality RS instruction-following dataset with 968k samples, namely SkyEye-968k. To this end, we introduce SkyEyeGPT, a unified multi-modal large language model specifically designed for RS multi-granularity vision-language understanding. Our research demonstrates that with a simple yet effective design, SkyEyeGPT works surprisingly well on considerably different tasks without the need for extra encoding modules. Specifically, after projecting RS visual features to the language domain via an alignment layer, they are fed jointly with task-specific instructions into an LLM-based RS decoder to predict answers for RS open-ended tasks. In addition, we design a two-stage tuning method to enhance instruction-following and multi-turn dialogue ability at different granularities. Experiments on 8 datasets for RS vision-language tasks demonstrate SkyEyeGPT's superiority in image-level and region-level tasks, such as captioning and visual grounding. In particular, SkyEyeGPT exhibits encouraging results compared to GPT-4V in some qualitative tests. The online demo, code, and dataset will be released.",

keywords = "Instruction tuning, Large language model, Multi-modal, Remote sensing vision-language",

author = "Yang Zhan and Zhitong Xiong and Yuan Yuan",

note = "Publisher Copyright: {\textcopyright} 2025 The Authors",

year = "2025",

month = mar,

doi = "10.1016/j.isprsjprs.2025.01.020",

language = "英语",

volume = "221",

pages = "64--77",

journal = "ISPRS Journal of Photogrammetry and Remote Sensing",

issn = "0924-2716",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - SkyEyeGPT

T2 - Unifying remote sensing vision-language tasks via instruction tuning with large language model

AU - Zhan, Yang

AU - Xiong, Zhitong

AU - Yuan, Yuan

PY - 2025/3

Y1 - 2025/3

N2 - Large language models (LLMs) have recently been extended to the vision-language realm, obtaining impressive general multi-modal capabilities. However, the exploration of multi-modal large language models (MLLMs) for remote sensing (RS) data is still in its infancy, lacking datasets and with unsatisfactory performance. In this work, we meticulously curate a large-scale RS multi-modal instruction tuning dataset, including single-task and multi-task conversation instructions. After manual verification, we obtain a high-quality RS instruction-following dataset with 968k samples, namely SkyEye-968k. To this end, we introduce SkyEyeGPT, a unified multi-modal large language model specifically designed for RS multi-granularity vision-language understanding. Our research demonstrates that with a simple yet effective design, SkyEyeGPT works surprisingly well on considerably different tasks without the need for extra encoding modules. Specifically, after projecting RS visual features to the language domain via an alignment layer, they are fed jointly with task-specific instructions into an LLM-based RS decoder to predict answers for RS open-ended tasks. In addition, we design a two-stage tuning method to enhance instruction-following and multi-turn dialogue ability at different granularities. Experiments on 8 datasets for RS vision-language tasks demonstrate SkyEyeGPT's superiority in image-level and region-level tasks, such as captioning and visual grounding. In particular, SkyEyeGPT exhibits encouraging results compared to GPT-4V in some qualitative tests. The online demo, code, and dataset will be released.

AB - Large language models (LLMs) have recently been extended to the vision-language realm, obtaining impressive general multi-modal capabilities. However, the exploration of multi-modal large language models (MLLMs) for remote sensing (RS) data is still in its infancy, lacking datasets and with unsatisfactory performance. In this work, we meticulously curate a large-scale RS multi-modal instruction tuning dataset, including single-task and multi-task conversation instructions. After manual verification, we obtain a high-quality RS instruction-following dataset with 968k samples, namely SkyEye-968k. To this end, we introduce SkyEyeGPT, a unified multi-modal large language model specifically designed for RS multi-granularity vision-language understanding. Our research demonstrates that with a simple yet effective design, SkyEyeGPT works surprisingly well on considerably different tasks without the need for extra encoding modules. Specifically, after projecting RS visual features to the language domain via an alignment layer, they are fed jointly with task-specific instructions into an LLM-based RS decoder to predict answers for RS open-ended tasks. In addition, we design a two-stage tuning method to enhance instruction-following and multi-turn dialogue ability at different granularities. Experiments on 8 datasets for RS vision-language tasks demonstrate SkyEyeGPT's superiority in image-level and region-level tasks, such as captioning and visual grounding. In particular, SkyEyeGPT exhibits encouraging results compared to GPT-4V in some qualitative tests. The online demo, code, and dataset will be released.

KW - Instruction tuning

KW - Large language model

KW - Multi-modal

KW - Remote sensing vision-language

UR - http://www.scopus.com/inward/record.url?scp=85216830090&partnerID=8YFLogxK

U2 - 10.1016/j.isprsjprs.2025.01.020

DO - 10.1016/j.isprsjprs.2025.01.020

M3 - 文章

AN - SCOPUS:85216830090

SN - 0924-2716

VL - 221

SP - 64

EP - 77

JO - ISPRS Journal of Photogrammetry and Remote Sensing

JF - ISPRS Journal of Photogrammetry and Remote Sensing

ER -

SkyEyeGPT: Unifying remote sensing vision-language tasks via instruction tuning with large language model

摘要

访问文件

其它文件与链接

指纹

引用此