RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data

Yang Zhan; Zhitong Xiong; Yuan Yuan

doi:10.1109/TGRS.2023.3250471

RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data

Yang Zhan, Zhitong Xiong, Yuan Yuan

School of Artificial Intelligence, OPtics and Electronics

Research output: Contribution to journal › Article › peer-review

86 Scopus citations

Abstract

In this article, we introduce the task of visual grounding for remote sensing data (RSVG). RSVG aims to localize the referred objects in remote sensing (RS) images with the guidance of natural language. To retrieve rich information from RS imagery using natural language, many research tasks, such as RS image visual question answering, RS image captioning, and RS image-text retrieval, have been investigated a lot. However, the object-level visual grounding on RS images is still underexplored. Thus, in this work, we propose to construct the dataset and explore deep learning models for the RSVG task. Specifically, our contributions can be summarized as follows. First, we build the new large-scale benchmark of RSVG based on detection in optical remote sensing (DIOR) dataset, termed DIOR-RSVG, to fully advance the research of RSVG. This new dataset includes image/expression/box triplets for training and evaluating visual grounding models. Second, we benchmark extensive state-of-the-art (SOTA) natural image visual grounding methods on the constructed DIOR-RSVG dataset, and some insightful analyses are provided based on the results. Third, a novel transformer-based multigranularity visual language fusion (MGVLF) module is proposed. Remotely sensed images are usually with large-scale variations and cluttered backgrounds. To deal with the scale-variation problem, the MGVLF module takes advantage of multiscale visual features and multigranularity textual embeddings to learn more discriminative representations. To cope with the cluttered background problem, MGVLF adaptively filters irrelevant noise and enhances salient features. In this way, our proposed model can incorporate more effective multilevel and multimodal features to boost performance. This work can provide useful insights for developing better RSVG models.

Original language	English
Article number	5604513
Journal	IEEE Transactions on Geoscience and Remote Sensing
Volume	61
DOIs	https://doi.org/10.1109/TGRS.2023.3250471
State	Published - 2023

Keywords

Multigranularity visual language fusion (MGVLF)
transformer
visual grounding for remote sensing data (RSVG)

Access to Document

10.1109/TGRS.2023.3250471

Cite this

@article{d37b674ae9944e54995be5f4cec5337f,

title = "RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data",

abstract = "In this article, we introduce the task of visual grounding for remote sensing data (RSVG). RSVG aims to localize the referred objects in remote sensing (RS) images with the guidance of natural language. To retrieve rich information from RS imagery using natural language, many research tasks, such as RS image visual question answering, RS image captioning, and RS image-text retrieval, have been investigated a lot. However, the object-level visual grounding on RS images is still underexplored. Thus, in this work, we propose to construct the dataset and explore deep learning models for the RSVG task. Specifically, our contributions can be summarized as follows. First, we build the new large-scale benchmark of RSVG based on detection in optical remote sensing (DIOR) dataset, termed DIOR-RSVG, to fully advance the research of RSVG. This new dataset includes image/expression/box triplets for training and evaluating visual grounding models. Second, we benchmark extensive state-of-the-art (SOTA) natural image visual grounding methods on the constructed DIOR-RSVG dataset, and some insightful analyses are provided based on the results. Third, a novel transformer-based multigranularity visual language fusion (MGVLF) module is proposed. Remotely sensed images are usually with large-scale variations and cluttered backgrounds. To deal with the scale-variation problem, the MGVLF module takes advantage of multiscale visual features and multigranularity textual embeddings to learn more discriminative representations. To cope with the cluttered background problem, MGVLF adaptively filters irrelevant noise and enhances salient features. In this way, our proposed model can incorporate more effective multilevel and multimodal features to boost performance. This work can provide useful insights for developing better RSVG models.",

keywords = "Multigranularity visual language fusion (MGVLF), transformer, visual grounding for remote sensing data (RSVG)",

author = "Yang Zhan and Zhitong Xiong and Yuan Yuan",

note = "Publisher Copyright: {\textcopyright} 1980-2012 IEEE.",

year = "2023",

doi = "10.1109/TGRS.2023.3250471",

language = "英语",

volume = "61",

journal = "IEEE Transactions on Geoscience and Remote Sensing",

issn = "0196-2892",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - RSVG

T2 - Exploring Data and Models for Visual Grounding on Remote Sensing Data

AU - Zhan, Yang

AU - Xiong, Zhitong

AU - Yuan, Yuan

PY - 2023

Y1 - 2023

N2 - In this article, we introduce the task of visual grounding for remote sensing data (RSVG). RSVG aims to localize the referred objects in remote sensing (RS) images with the guidance of natural language. To retrieve rich information from RS imagery using natural language, many research tasks, such as RS image visual question answering, RS image captioning, and RS image-text retrieval, have been investigated a lot. However, the object-level visual grounding on RS images is still underexplored. Thus, in this work, we propose to construct the dataset and explore deep learning models for the RSVG task. Specifically, our contributions can be summarized as follows. First, we build the new large-scale benchmark of RSVG based on detection in optical remote sensing (DIOR) dataset, termed DIOR-RSVG, to fully advance the research of RSVG. This new dataset includes image/expression/box triplets for training and evaluating visual grounding models. Second, we benchmark extensive state-of-the-art (SOTA) natural image visual grounding methods on the constructed DIOR-RSVG dataset, and some insightful analyses are provided based on the results. Third, a novel transformer-based multigranularity visual language fusion (MGVLF) module is proposed. Remotely sensed images are usually with large-scale variations and cluttered backgrounds. To deal with the scale-variation problem, the MGVLF module takes advantage of multiscale visual features and multigranularity textual embeddings to learn more discriminative representations. To cope with the cluttered background problem, MGVLF adaptively filters irrelevant noise and enhances salient features. In this way, our proposed model can incorporate more effective multilevel and multimodal features to boost performance. This work can provide useful insights for developing better RSVG models.

AB - In this article, we introduce the task of visual grounding for remote sensing data (RSVG). RSVG aims to localize the referred objects in remote sensing (RS) images with the guidance of natural language. To retrieve rich information from RS imagery using natural language, many research tasks, such as RS image visual question answering, RS image captioning, and RS image-text retrieval, have been investigated a lot. However, the object-level visual grounding on RS images is still underexplored. Thus, in this work, we propose to construct the dataset and explore deep learning models for the RSVG task. Specifically, our contributions can be summarized as follows. First, we build the new large-scale benchmark of RSVG based on detection in optical remote sensing (DIOR) dataset, termed DIOR-RSVG, to fully advance the research of RSVG. This new dataset includes image/expression/box triplets for training and evaluating visual grounding models. Second, we benchmark extensive state-of-the-art (SOTA) natural image visual grounding methods on the constructed DIOR-RSVG dataset, and some insightful analyses are provided based on the results. Third, a novel transformer-based multigranularity visual language fusion (MGVLF) module is proposed. Remotely sensed images are usually with large-scale variations and cluttered backgrounds. To deal with the scale-variation problem, the MGVLF module takes advantage of multiscale visual features and multigranularity textual embeddings to learn more discriminative representations. To cope with the cluttered background problem, MGVLF adaptively filters irrelevant noise and enhances salient features. In this way, our proposed model can incorporate more effective multilevel and multimodal features to boost performance. This work can provide useful insights for developing better RSVG models.

KW - Multigranularity visual language fusion (MGVLF)

KW - transformer

KW - visual grounding for remote sensing data (RSVG)

UR - http://www.scopus.com/inward/record.url?scp=85149397876&partnerID=8YFLogxK

U2 - 10.1109/TGRS.2023.3250471

DO - 10.1109/TGRS.2023.3250471

M3 - 文章

AN - SCOPUS:85149397876

SN - 0196-2892

VL - 61

JO - IEEE Transactions on Geoscience and Remote Sensing

JF - IEEE Transactions on Geoscience and Remote Sensing

M1 - 5604513

ER -

RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this