TY - JOUR
T1 - GLCM
T2 - Global-Local Captioning Model for Remote Sensing Image Captioning
AU - Wang, Qi
AU - Huang, Wei
AU - Zhang, Xueting
AU - Li, Xuelong
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2023/11/1
Y1 - 2023/11/1
N2 - Remote sensing image captioning (RSIC), which describes a remote sensing image with a semantically related sentence, has been a cross-modal challenge between computer vision and natural language processing. For visual features extracted from remote sensing images, global features provide the complete and comprehensive visual relevance of all the words of a sentence simultaneously, while local features can emphasize the discrimination of these words individually. Therefore, not only global features are important for caption generation but also local features are meaningful for making the words more discriminative. In order to make full use of the advantages of both global and local features, in this article, we propose an attention-based global-local captioning model (GLCM) to obtain global-local visual feature representation for RSIC. Based on the proposed GLCM, the correlation of all the generated words and the relation of each separate word and the most related local visual features can be visualized in a similarity-based manner, which provides more interpretability for RSIC. In the extensive experiments, our method achieves comparable results in UCM-captions and superior results in Sydney-captions and RSICD which is the largest RSIC dataset.
AB - Remote sensing image captioning (RSIC), which describes a remote sensing image with a semantically related sentence, has been a cross-modal challenge between computer vision and natural language processing. For visual features extracted from remote sensing images, global features provide the complete and comprehensive visual relevance of all the words of a sentence simultaneously, while local features can emphasize the discrimination of these words individually. Therefore, not only global features are important for caption generation but also local features are meaningful for making the words more discriminative. In order to make full use of the advantages of both global and local features, in this article, we propose an attention-based global-local captioning model (GLCM) to obtain global-local visual feature representation for RSIC. Based on the proposed GLCM, the correlation of all the generated words and the relation of each separate word and the most related local visual features can be visualized in a similarity-based manner, which provides more interpretability for RSIC. In the extensive experiments, our method achieves comparable results in UCM-captions and superior results in Sydney-captions and RSICD which is the largest RSIC dataset.
KW - Deep learning
KW - global-local captioning model (GLCM)
KW - image captioning
KW - remote sensing
UR - http://www.scopus.com/inward/record.url?scp=85144028633&partnerID=8YFLogxK
U2 - 10.1109/TCYB.2022.3222606
DO - 10.1109/TCYB.2022.3222606
M3 - 文章
C2 - 36446004
AN - SCOPUS:85144028633
SN - 2168-2267
VL - 53
SP - 6910
EP - 6922
JO - IEEE Transactions on Cybernetics
JF - IEEE Transactions on Cybernetics
IS - 11
ER -