Abstract
With the benefits from deep learning technology, generating captions for remote sensing images has become achievable, and great progress has been made in this field in the recent years. However, a large-scale variation of remote sensing images, which would lead to errors or omissions in feature extraction, still limits the further improvement of caption quality. To address this problem, we propose a denoising-based multi-scale feature fusion (DMSFF) mechanism for remote sensing image captioning in this letter. The proposed DMSFF mechanism aggregates multiscale features with the denoising operation at the stage of visual feature extraction. It can help the encoder-decoder framework, which is widely used in image captioning, to obtain the denoising multiscale feature representation. In experiments, we apply the proposed DMSFF in the encoder-decoder framework and perform the comparative experiments on two public remote sensing image captioning data sets including UC Merced (UCM)-captions and Sydney-captions. The experimental results demonstrate the effectiveness of our method.
Original language | English |
---|---|
Article number | 9057472 |
Pages (from-to) | 436-440 |
Number of pages | 5 |
Journal | IEEE Geoscience and Remote Sensing Letters |
Volume | 18 |
Issue number | 3 |
DOIs | |
State | Published - Mar 2021 |
Keywords
- Deep learning
- encoder-decoder
- feature fusion
- image captioning
- multiscale
- remote sensing