HCNet: Hierarchical Feature Aggregation and Cross-Modal Feature Alignment for Remote Sensing Image Captioning

Zhigang Yang, Qiang Li, Yuan Yuan, Qi Wang

Research output: Contribution to journalArticlepeer-review

13 Scopus citations

Abstract

Remote sensing image captioning (RSIC) aims to describe the crucial objects from remote sensing images in the form of natural language. The inefficient utilization of object texture and semantic features in images, along with the ineffective cross-modal alignment between image and text features, are the primary factors that impact the model to generate high-quality captions. To alleviate this trouble, this article presents a network for RSIC, namely HCNet, including hierarchical feature aggregation and cross-modal feature alignment. Specifically, a hierarchical feature aggregation module (HFAM) is proposed to obtain a comprehensive representation of vision features, which is beneficial for producing accurate descriptions. Considering the disparities between different modal features, we design a cross-modal feature interaction module (CFIM) in the decoder to facilitate feature alignment. It can fully utilize cross-modal features to localize critical objects. Besides, a cross-modal feature align loss is introduced to realize the alignment between image and text features. Extensive experiments show our HCNet can achieve satisfactory performance. In particular, we demonstrate significant performance improvements of +14.15% CIDEr score on NWPU datasets compared to existing approaches. The source code is publicly available at https://github.com/CVer-Yang/HCNet.

Original languageEnglish
Article number5624711
Pages (from-to)1-11
Number of pages11
JournalIEEE Transactions on Geoscience and Remote Sensing
Volume62
DOIs
StatePublished - 2024

Keywords

  • Attention mechanism
  • feature aggregation
  • feature alignment
  • image caption
  • remote sensing

Fingerprint

Dive into the research topics of 'HCNet: Hierarchical Feature Aggregation and Cross-Modal Feature Alignment for Remote Sensing Image Captioning'. Together they form a unique fingerprint.

Cite this