Word-Sentence Framework for Remote Sensing Image Captioning

Qi Wang, Wei Huang, Xueting Zhang, Xuelong Li

Research output: Contribution to journalArticlepeer-review

86 Scopus citations

Abstract

Remote sensing image captioning (RSIC), which aims at generating a well-formed sentence for a remote sensing image, has attracted more attention in recent years. The general framework for RSIC is the encoder-decoder architecture containing two submodels of encoder and decoder. Although the significant performance is obtained, the encoder-decoder architecture is a black-box model with a lack of explainability. To overcome this drawback, in this article, we propose a new explainable word-sentence framework for RSIC. The proposed word-sentence framework consists of two parts: word extractor and sentence generator, where the former extracts the valuable words in the given remote sensing image, while the latter organizes these words into a well-formed sentence. The proposed framework decomposes RSIC into a word classification task and a word sorting task, which is more in line with human intuitive understanding. On the basis of the word-sentence framework, some ablation experiments are conducted on the three public RSIC data sets of Sydney-captions, UCM-captions, and RSICD to explore the specific and effective network structures. In order to evaluate the proposed word-sentence framework objectively, we further conduct some comparative experiments on these three data sets and achieve comparable results in comparison with the encoder-decoder-based methods.

Original languageEnglish
Pages (from-to)10532-10543
Number of pages12
JournalIEEE Transactions on Geoscience and Remote Sensing
Volume59
Issue number12
DOIs
StatePublished - 1 Dec 2021

Keywords

  • Deep learning
  • image captioning
  • remote sensing
  • word-sentence framework

Fingerprint

Dive into the research topics of 'Word-Sentence Framework for Remote Sensing Image Captioning'. Together they form a unique fingerprint.

Cite this