Exploring Multi-Level Attention and Semantic Relationship for Remote Sensing Image Captioning

Zhenghang Yuan, Xuelong Li, Qi Wang

Research output: Contribution to journalArticlepeer-review

45 Scopus citations

Abstract

Remote sensing image captioning, which aims to understand high-level semantic information and interactions of different ground objects, is a new emerging research topic in recent years. Though image captioning has developed rapidly with convolutional neural networks (CNNs) and recurrent neural networks (RNNs), the image captioning task for remote sensing images still suffers from two main limitations. One limitation is that the scales of objects in remote sensing images vary dramatically, which makes it difficult to obtain an effective image representation. Another limitation is that the visual relationship in remote sensing images is still underused, which should have great potential to improve the final performance. In order to deal with these two limitations, an effective framework for captioning the remote sensing image is proposed in this paper. The framework is based on multi-level attention and multi-label attribute graph convolution. Specifically, the proposed multi-level attention module can adaptively focus not only on specific spatial features, but also on features of specific scales. Moreover, the designed attribute graph convolution module can employ the attribute-graph to learn more effective attribute features for image captioning. Extensive experiments are conducted and the proposed method achieves superior performance on UCM-captions, Sydney-captions and RSICD dataset.

Original languageEnglish
Article number8943170
Pages (from-to)2608-2620
Number of pages13
JournalIEEE Access
Volume8
DOIs
StatePublished - 2020

Keywords

  • deep learning
  • graph convolutional networks (GCNs)
  • image captioning
  • Remote sensing image
  • semantic understanding

Fingerprint

Dive into the research topics of 'Exploring Multi-Level Attention and Semantic Relationship for Remote Sensing Image Captioning'. Together they form a unique fingerprint.

Cite this