Truncation Cross Entropy Loss for Remote Sensing Image Captioning

Xuelong Li; Xueting Zhang; Wei Huang; Qi Wang

doi:10.1109/TGRS.2020.3010106

Truncation Cross Entropy Loss for Remote Sensing Image Captioning

Xuelong Li, Xueting Zhang, Wei Huang, Qi Wang

光电与智能研究院

科研成果: 期刊稿件 › 文章 › 同行评审

83 引用（Scopus）

摘要

Recently, remote sensing image captioning (RSIC) has drawn an increasing attention. In this field, the encoder-decoder-based methods have become the mainstream due to their excellent performance. In the encoder-decoder framework, the convolutional neural network (CNN) is used to encode a remote sensing image into a semantic feature vector, and a sequence model such as long short-term memory (LSTM) is subsequently adopted to generate a content-related caption based on the feature vector. During the traditional training stage, the probability of the target word at each time step is forcibly optimized to 1 by the cross entropy (CE) loss. However, because of the variability and ambiguity of possible image captions, the target word could be replaced by other words like its synonyms, and therefore, such an optimization strategy would result in the overfitting of the network. In this article, we explore the overfitting phenomenon in the RSIC caused by CE loss and correspondingly propose a new truncation cross entropy (TCE) loss, aiming to alleviate the overfitting problem. In order to verify the effectiveness of the proposed approach, extensive comparison experiments are performed on three public RSIC data sets, including UCM-captions, Sydney-captions, and RSICD. The state-of-the-art result of Sydney-captions and RSICD and the competitive results of UCM-captions achieved by TCE loss demonstrate that the proposed method is beneficial to RSIC.

源语言	英语
文章编号	9153154
页（从-至）	5246-5257
页数	12
期刊	IEEE Transactions on Geoscience and Remote Sensing
卷	59
期	6
DOI	https://doi.org/10.1109/TGRS.2020.3010106
出版状态	已出版 - 6月 2021

访问文件

10.1109/TGRS.2020.3010106

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{551410b291644ef48b67369030fe638a,

title = "Truncation Cross Entropy Loss for Remote Sensing Image Captioning",

abstract = "Recently, remote sensing image captioning (RSIC) has drawn an increasing attention. In this field, the encoder-decoder-based methods have become the mainstream due to their excellent performance. In the encoder-decoder framework, the convolutional neural network (CNN) is used to encode a remote sensing image into a semantic feature vector, and a sequence model such as long short-term memory (LSTM) is subsequently adopted to generate a content-related caption based on the feature vector. During the traditional training stage, the probability of the target word at each time step is forcibly optimized to 1 by the cross entropy (CE) loss. However, because of the variability and ambiguity of possible image captions, the target word could be replaced by other words like its synonyms, and therefore, such an optimization strategy would result in the overfitting of the network. In this article, we explore the overfitting phenomenon in the RSIC caused by CE loss and correspondingly propose a new truncation cross entropy (TCE) loss, aiming to alleviate the overfitting problem. In order to verify the effectiveness of the proposed approach, extensive comparison experiments are performed on three public RSIC data sets, including UCM-captions, Sydney-captions, and RSICD. The state-of-the-art result of Sydney-captions and RSICD and the competitive results of UCM-captions achieved by TCE loss demonstrate that the proposed method is beneficial to RSIC.",

keywords = "Image captioning, overfitting, remote sensing, truncation cross entropy (TCE) loss",

author = "Xuelong Li and Xueting Zhang and Wei Huang and Qi Wang",

note = "Publisher Copyright: {\textcopyright} 1980-2012 IEEE.",

year = "2021",

month = jun,

doi = "10.1109/TGRS.2020.3010106",

language = "英语",

volume = "59",

pages = "5246--5257",

journal = "IEEE Transactions on Geoscience and Remote Sensing",

issn = "0196-2892",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "6",

}

TY - JOUR

T1 - Truncation Cross Entropy Loss for Remote Sensing Image Captioning

AU - Li, Xuelong

AU - Zhang, Xueting

AU - Huang, Wei

AU - Wang, Qi

PY - 2021/6

Y1 - 2021/6

N2 - Recently, remote sensing image captioning (RSIC) has drawn an increasing attention. In this field, the encoder-decoder-based methods have become the mainstream due to their excellent performance. In the encoder-decoder framework, the convolutional neural network (CNN) is used to encode a remote sensing image into a semantic feature vector, and a sequence model such as long short-term memory (LSTM) is subsequently adopted to generate a content-related caption based on the feature vector. During the traditional training stage, the probability of the target word at each time step is forcibly optimized to 1 by the cross entropy (CE) loss. However, because of the variability and ambiguity of possible image captions, the target word could be replaced by other words like its synonyms, and therefore, such an optimization strategy would result in the overfitting of the network. In this article, we explore the overfitting phenomenon in the RSIC caused by CE loss and correspondingly propose a new truncation cross entropy (TCE) loss, aiming to alleviate the overfitting problem. In order to verify the effectiveness of the proposed approach, extensive comparison experiments are performed on three public RSIC data sets, including UCM-captions, Sydney-captions, and RSICD. The state-of-the-art result of Sydney-captions and RSICD and the competitive results of UCM-captions achieved by TCE loss demonstrate that the proposed method is beneficial to RSIC.

AB - Recently, remote sensing image captioning (RSIC) has drawn an increasing attention. In this field, the encoder-decoder-based methods have become the mainstream due to their excellent performance. In the encoder-decoder framework, the convolutional neural network (CNN) is used to encode a remote sensing image into a semantic feature vector, and a sequence model such as long short-term memory (LSTM) is subsequently adopted to generate a content-related caption based on the feature vector. During the traditional training stage, the probability of the target word at each time step is forcibly optimized to 1 by the cross entropy (CE) loss. However, because of the variability and ambiguity of possible image captions, the target word could be replaced by other words like its synonyms, and therefore, such an optimization strategy would result in the overfitting of the network. In this article, we explore the overfitting phenomenon in the RSIC caused by CE loss and correspondingly propose a new truncation cross entropy (TCE) loss, aiming to alleviate the overfitting problem. In order to verify the effectiveness of the proposed approach, extensive comparison experiments are performed on three public RSIC data sets, including UCM-captions, Sydney-captions, and RSICD. The state-of-the-art result of Sydney-captions and RSICD and the competitive results of UCM-captions achieved by TCE loss demonstrate that the proposed method is beneficial to RSIC.

KW - Image captioning

KW - overfitting

KW - remote sensing

KW - truncation cross entropy (TCE) loss

UR - http://www.scopus.com/inward/record.url?scp=85106669625&partnerID=8YFLogxK

U2 - 10.1109/TGRS.2020.3010106

DO - 10.1109/TGRS.2020.3010106

M3 - 文章

AN - SCOPUS:85106669625

SN - 0196-2892

VL - 59

SP - 5246

EP - 5257

JO - IEEE Transactions on Geoscience and Remote Sensing

JF - IEEE Transactions on Geoscience and Remote Sensing

IS - 6

M1 - 9153154

ER -

Truncation Cross Entropy Loss for Remote Sensing Image Captioning

摘要

访问文件

其它文件与链接

指纹

引用此