TY - JOUR
T1 - Truncation Cross Entropy Loss for Remote Sensing Image Captioning
AU - Li, Xuelong
AU - Zhang, Xueting
AU - Huang, Wei
AU - Wang, Qi
N1 - Publisher Copyright:
© 1980-2012 IEEE.
PY - 2021/6
Y1 - 2021/6
N2 - Recently, remote sensing image captioning (RSIC) has drawn an increasing attention. In this field, the encoder-decoder-based methods have become the mainstream due to their excellent performance. In the encoder-decoder framework, the convolutional neural network (CNN) is used to encode a remote sensing image into a semantic feature vector, and a sequence model such as long short-term memory (LSTM) is subsequently adopted to generate a content-related caption based on the feature vector. During the traditional training stage, the probability of the target word at each time step is forcibly optimized to 1 by the cross entropy (CE) loss. However, because of the variability and ambiguity of possible image captions, the target word could be replaced by other words like its synonyms, and therefore, such an optimization strategy would result in the overfitting of the network. In this article, we explore the overfitting phenomenon in the RSIC caused by CE loss and correspondingly propose a new truncation cross entropy (TCE) loss, aiming to alleviate the overfitting problem. In order to verify the effectiveness of the proposed approach, extensive comparison experiments are performed on three public RSIC data sets, including UCM-captions, Sydney-captions, and RSICD. The state-of-the-art result of Sydney-captions and RSICD and the competitive results of UCM-captions achieved by TCE loss demonstrate that the proposed method is beneficial to RSIC.
AB - Recently, remote sensing image captioning (RSIC) has drawn an increasing attention. In this field, the encoder-decoder-based methods have become the mainstream due to their excellent performance. In the encoder-decoder framework, the convolutional neural network (CNN) is used to encode a remote sensing image into a semantic feature vector, and a sequence model such as long short-term memory (LSTM) is subsequently adopted to generate a content-related caption based on the feature vector. During the traditional training stage, the probability of the target word at each time step is forcibly optimized to 1 by the cross entropy (CE) loss. However, because of the variability and ambiguity of possible image captions, the target word could be replaced by other words like its synonyms, and therefore, such an optimization strategy would result in the overfitting of the network. In this article, we explore the overfitting phenomenon in the RSIC caused by CE loss and correspondingly propose a new truncation cross entropy (TCE) loss, aiming to alleviate the overfitting problem. In order to verify the effectiveness of the proposed approach, extensive comparison experiments are performed on three public RSIC data sets, including UCM-captions, Sydney-captions, and RSICD. The state-of-the-art result of Sydney-captions and RSICD and the competitive results of UCM-captions achieved by TCE loss demonstrate that the proposed method is beneficial to RSIC.
KW - Image captioning
KW - overfitting
KW - remote sensing
KW - truncation cross entropy (TCE) loss
UR - http://www.scopus.com/inward/record.url?scp=85106669625&partnerID=8YFLogxK
U2 - 10.1109/TGRS.2020.3010106
DO - 10.1109/TGRS.2020.3010106
M3 - 文章
AN - SCOPUS:85106669625
SN - 0196-2892
VL - 59
SP - 5246
EP - 5257
JO - IEEE Transactions on Geoscience and Remote Sensing
JF - IEEE Transactions on Geoscience and Remote Sensing
IS - 6
M1 - 9153154
ER -