Word-Sentence Framework for Remote Sensing Image Captioning

Qi Wang; Wei Huang; Xueting Zhang; Xuelong Li

doi:10.1109/TGRS.2020.3044054

Word-Sentence Framework for Remote Sensing Image Captioning

Qi Wang, Wei Huang, Xueting Zhang, Xuelong Li

Northwestern Polytechnical University Xian

Research output: Contribution to journal › Article › peer-review

87 Scopus citations

Abstract

Remote sensing image captioning (RSIC), which aims at generating a well-formed sentence for a remote sensing image, has attracted more attention in recent years. The general framework for RSIC is the encoder-decoder architecture containing two submodels of encoder and decoder. Although the significant performance is obtained, the encoder-decoder architecture is a black-box model with a lack of explainability. To overcome this drawback, in this article, we propose a new explainable word-sentence framework for RSIC. The proposed word-sentence framework consists of two parts: word extractor and sentence generator, where the former extracts the valuable words in the given remote sensing image, while the latter organizes these words into a well-formed sentence. The proposed framework decomposes RSIC into a word classification task and a word sorting task, which is more in line with human intuitive understanding. On the basis of the word-sentence framework, some ablation experiments are conducted on the three public RSIC data sets of Sydney-captions, UCM-captions, and RSICD to explore the specific and effective network structures. In order to evaluate the proposed word-sentence framework objectively, we further conduct some comparative experiments on these three data sets and achieve comparable results in comparison with the encoder-decoder-based methods.

Original language	English
Pages (from-to)	10532-10543
Number of pages	12
Journal	IEEE Transactions on Geoscience and Remote Sensing
Volume	59
Issue number	12
DOIs	https://doi.org/10.1109/TGRS.2020.3044054
State	Published - 1 Dec 2021

Keywords

Deep learning
image captioning
remote sensing
word-sentence framework

Access to Document

10.1109/TGRS.2020.3044054

Cite this

@article{ff050bb780894935885702989240ac39,

title = "Word-Sentence Framework for Remote Sensing Image Captioning",

abstract = "Remote sensing image captioning (RSIC), which aims at generating a well-formed sentence for a remote sensing image, has attracted more attention in recent years. The general framework for RSIC is the encoder-decoder architecture containing two submodels of encoder and decoder. Although the significant performance is obtained, the encoder-decoder architecture is a black-box model with a lack of explainability. To overcome this drawback, in this article, we propose a new explainable word-sentence framework for RSIC. The proposed word-sentence framework consists of two parts: word extractor and sentence generator, where the former extracts the valuable words in the given remote sensing image, while the latter organizes these words into a well-formed sentence. The proposed framework decomposes RSIC into a word classification task and a word sorting task, which is more in line with human intuitive understanding. On the basis of the word-sentence framework, some ablation experiments are conducted on the three public RSIC data sets of Sydney-captions, UCM-captions, and RSICD to explore the specific and effective network structures. In order to evaluate the proposed word-sentence framework objectively, we further conduct some comparative experiments on these three data sets and achieve comparable results in comparison with the encoder-decoder-based methods.",

keywords = "Deep learning, image captioning, remote sensing, word-sentence framework",

author = "Qi Wang and Wei Huang and Xueting Zhang and Xuelong Li",

note = "Publisher Copyright: {\textcopyright} 1980-2012 IEEE.",

year = "2021",

month = dec,

day = "1",

doi = "10.1109/TGRS.2020.3044054",

language = "英语",

volume = "59",

pages = "10532--10543",

journal = "IEEE Transactions on Geoscience and Remote Sensing",

issn = "0196-2892",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "12",

}

TY - JOUR

T1 - Word-Sentence Framework for Remote Sensing Image Captioning

AU - Wang, Qi

AU - Huang, Wei

AU - Zhang, Xueting

AU - Li, Xuelong

PY - 2021/12/1

Y1 - 2021/12/1

N2 - Remote sensing image captioning (RSIC), which aims at generating a well-formed sentence for a remote sensing image, has attracted more attention in recent years. The general framework for RSIC is the encoder-decoder architecture containing two submodels of encoder and decoder. Although the significant performance is obtained, the encoder-decoder architecture is a black-box model with a lack of explainability. To overcome this drawback, in this article, we propose a new explainable word-sentence framework for RSIC. The proposed word-sentence framework consists of two parts: word extractor and sentence generator, where the former extracts the valuable words in the given remote sensing image, while the latter organizes these words into a well-formed sentence. The proposed framework decomposes RSIC into a word classification task and a word sorting task, which is more in line with human intuitive understanding. On the basis of the word-sentence framework, some ablation experiments are conducted on the three public RSIC data sets of Sydney-captions, UCM-captions, and RSICD to explore the specific and effective network structures. In order to evaluate the proposed word-sentence framework objectively, we further conduct some comparative experiments on these three data sets and achieve comparable results in comparison with the encoder-decoder-based methods.

AB - Remote sensing image captioning (RSIC), which aims at generating a well-formed sentence for a remote sensing image, has attracted more attention in recent years. The general framework for RSIC is the encoder-decoder architecture containing two submodels of encoder and decoder. Although the significant performance is obtained, the encoder-decoder architecture is a black-box model with a lack of explainability. To overcome this drawback, in this article, we propose a new explainable word-sentence framework for RSIC. The proposed word-sentence framework consists of two parts: word extractor and sentence generator, where the former extracts the valuable words in the given remote sensing image, while the latter organizes these words into a well-formed sentence. The proposed framework decomposes RSIC into a word classification task and a word sorting task, which is more in line with human intuitive understanding. On the basis of the word-sentence framework, some ablation experiments are conducted on the three public RSIC data sets of Sydney-captions, UCM-captions, and RSICD to explore the specific and effective network structures. In order to evaluate the proposed word-sentence framework objectively, we further conduct some comparative experiments on these three data sets and achieve comparable results in comparison with the encoder-decoder-based methods.

KW - Deep learning

KW - image captioning

KW - remote sensing

KW - word-sentence framework

UR - http://www.scopus.com/inward/record.url?scp=85098769010&partnerID=8YFLogxK

U2 - 10.1109/TGRS.2020.3044054

DO - 10.1109/TGRS.2020.3044054

M3 - 文章

AN - SCOPUS:85098769010

SN - 0196-2892

VL - 59

SP - 10532

EP - 10543

JO - IEEE Transactions on Geoscience and Remote Sensing

JF - IEEE Transactions on Geoscience and Remote Sensing

IS - 12

ER -

Word-Sentence Framework for Remote Sensing Image Captioning

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this