Abstract
Remote sensing image captioning (RSIC) aims to describe the crucial objects from remote sensing images in the form of natural language. The inefficient utilization of object texture and semantic features in images, along with the ineffective cross-modal alignment between image and text features, are the primary factors that impact the model to generate high-quality captions. To alleviate this trouble, this article presents a network for RSIC, namely HCNet, including hierarchical feature aggregation and cross-modal feature alignment. Specifically, a hierarchical feature aggregation module (HFAM) is proposed to obtain a comprehensive representation of vision features, which is beneficial for producing accurate descriptions. Considering the disparities between different modal features, we design a cross-modal feature interaction module (CFIM) in the decoder to facilitate feature alignment. It can fully utilize cross-modal features to localize critical objects. Besides, a cross-modal feature align loss is introduced to realize the alignment between image and text features. Extensive experiments show our HCNet can achieve satisfactory performance. In particular, we demonstrate significant performance improvements of +14.15% CIDEr score on NWPU datasets compared to existing approaches. The source code is publicly available at https://github.com/CVer-Yang/HCNet.
Original language | English |
---|---|
Article number | 5624711 |
Pages (from-to) | 1-11 |
Number of pages | 11 |
Journal | IEEE Transactions on Geoscience and Remote Sensing |
Volume | 62 |
DOIs | |
State | Published - 2024 |
Keywords
- Attention mechanism
- feature aggregation
- feature alignment
- image caption
- remote sensing