Word-Sentence Framework for Remote Sensing Image Captioning

Qi Wang; Wei Huang; Xueting Zhang; Xuelong Li

doi:10.1109/TGRS.2020.3044054

Word-Sentence Framework for Remote Sensing Image Captioning

Qi Wang, Wei Huang, Xueting Zhang, Xuelong Li

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 文章 › 同行评审

86 引用（Scopus）

摘要

Remote sensing image captioning (RSIC), which aims at generating a well-formed sentence for a remote sensing image, has attracted more attention in recent years. The general framework for RSIC is the encoder-decoder architecture containing two submodels of encoder and decoder. Although the significant performance is obtained, the encoder-decoder architecture is a black-box model with a lack of explainability. To overcome this drawback, in this article, we propose a new explainable word-sentence framework for RSIC. The proposed word-sentence framework consists of two parts: word extractor and sentence generator, where the former extracts the valuable words in the given remote sensing image, while the latter organizes these words into a well-formed sentence. The proposed framework decomposes RSIC into a word classification task and a word sorting task, which is more in line with human intuitive understanding. On the basis of the word-sentence framework, some ablation experiments are conducted on the three public RSIC data sets of Sydney-captions, UCM-captions, and RSICD to explore the specific and effective network structures. In order to evaluate the proposed word-sentence framework objectively, we further conduct some comparative experiments on these three data sets and achieve comparable results in comparison with the encoder-decoder-based methods.

源语言	英语
页（从-至）	10532-10543
页数	12
期刊	IEEE Transactions on Geoscience and Remote Sensing
卷	59
期	12
DOI	https://doi.org/10.1109/TGRS.2020.3044054
出版状态	已出版 - 1 12月 2021

访问文件

10.1109/TGRS.2020.3044054

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{ff050bb780894935885702989240ac39,

title = "Word-Sentence Framework for Remote Sensing Image Captioning",

abstract = "Remote sensing image captioning (RSIC), which aims at generating a well-formed sentence for a remote sensing image, has attracted more attention in recent years. The general framework for RSIC is the encoder-decoder architecture containing two submodels of encoder and decoder. Although the significant performance is obtained, the encoder-decoder architecture is a black-box model with a lack of explainability. To overcome this drawback, in this article, we propose a new explainable word-sentence framework for RSIC. The proposed word-sentence framework consists of two parts: word extractor and sentence generator, where the former extracts the valuable words in the given remote sensing image, while the latter organizes these words into a well-formed sentence. The proposed framework decomposes RSIC into a word classification task and a word sorting task, which is more in line with human intuitive understanding. On the basis of the word-sentence framework, some ablation experiments are conducted on the three public RSIC data sets of Sydney-captions, UCM-captions, and RSICD to explore the specific and effective network structures. In order to evaluate the proposed word-sentence framework objectively, we further conduct some comparative experiments on these three data sets and achieve comparable results in comparison with the encoder-decoder-based methods.",

keywords = "Deep learning, image captioning, remote sensing, word-sentence framework",

author = "Qi Wang and Wei Huang and Xueting Zhang and Xuelong Li",

note = "Publisher Copyright: {\textcopyright} 1980-2012 IEEE.",

year = "2021",

month = dec,

day = "1",

doi = "10.1109/TGRS.2020.3044054",

language = "英语",

volume = "59",

pages = "10532--10543",

journal = "IEEE Transactions on Geoscience and Remote Sensing",

issn = "0196-2892",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "12",

}

TY - JOUR

T1 - Word-Sentence Framework for Remote Sensing Image Captioning

AU - Wang, Qi

AU - Huang, Wei

AU - Zhang, Xueting

AU - Li, Xuelong

PY - 2021/12/1

Y1 - 2021/12/1

N2 - Remote sensing image captioning (RSIC), which aims at generating a well-formed sentence for a remote sensing image, has attracted more attention in recent years. The general framework for RSIC is the encoder-decoder architecture containing two submodels of encoder and decoder. Although the significant performance is obtained, the encoder-decoder architecture is a black-box model with a lack of explainability. To overcome this drawback, in this article, we propose a new explainable word-sentence framework for RSIC. The proposed word-sentence framework consists of two parts: word extractor and sentence generator, where the former extracts the valuable words in the given remote sensing image, while the latter organizes these words into a well-formed sentence. The proposed framework decomposes RSIC into a word classification task and a word sorting task, which is more in line with human intuitive understanding. On the basis of the word-sentence framework, some ablation experiments are conducted on the three public RSIC data sets of Sydney-captions, UCM-captions, and RSICD to explore the specific and effective network structures. In order to evaluate the proposed word-sentence framework objectively, we further conduct some comparative experiments on these three data sets and achieve comparable results in comparison with the encoder-decoder-based methods.

AB - Remote sensing image captioning (RSIC), which aims at generating a well-formed sentence for a remote sensing image, has attracted more attention in recent years. The general framework for RSIC is the encoder-decoder architecture containing two submodels of encoder and decoder. Although the significant performance is obtained, the encoder-decoder architecture is a black-box model with a lack of explainability. To overcome this drawback, in this article, we propose a new explainable word-sentence framework for RSIC. The proposed word-sentence framework consists of two parts: word extractor and sentence generator, where the former extracts the valuable words in the given remote sensing image, while the latter organizes these words into a well-formed sentence. The proposed framework decomposes RSIC into a word classification task and a word sorting task, which is more in line with human intuitive understanding. On the basis of the word-sentence framework, some ablation experiments are conducted on the three public RSIC data sets of Sydney-captions, UCM-captions, and RSICD to explore the specific and effective network structures. In order to evaluate the proposed word-sentence framework objectively, we further conduct some comparative experiments on these three data sets and achieve comparable results in comparison with the encoder-decoder-based methods.

KW - Deep learning

KW - image captioning

KW - remote sensing

KW - word-sentence framework

UR - http://www.scopus.com/inward/record.url?scp=85098769010&partnerID=8YFLogxK

U2 - 10.1109/TGRS.2020.3044054

DO - 10.1109/TGRS.2020.3044054

M3 - 文章

AN - SCOPUS:85098769010

SN - 0196-2892

VL - 59

SP - 10532

EP - 10543

JO - IEEE Transactions on Geoscience and Remote Sensing

JF - IEEE Transactions on Geoscience and Remote Sensing

IS - 12

ER -

Word-Sentence Framework for Remote Sensing Image Captioning

摘要

访问文件

其它文件与链接

指纹

引用此