TY - JOUR
T1 - Single-Stream Extractor Network with Contrastive Pre-Training for Remote-Sensing Change Captioning
AU - Zhou, Qing
AU - Gao, Junyu
AU - Yuan, Yuan
AU - Wang, Qi
N1 - Publisher Copyright:
© 1980-2012 IEEE.
PY - 2024
Y1 - 2024
N2 - Remote-sensing (RS) image change captioning (RSICC) is a visual semantic understanding task that has received increasing attention. The change captioning methods are required to understand the visual information of the images and capture the most significant difference between them, then describe it in natural language. Most existing methods mainly focus on improving the difference feature encoder or language decoder, while ignoring the visual feature extractor. The current feature extractors suffer from several issues, including: 1) domain gap between pre-training on single-temporal natural images and downstream bitemporal RS task; 2) limited difference feature modeling in the implicit single-stream network; and 3) high computational costs caused by extracting features for each temporal phase image under the dual-stream extractor. To address these issues, we propose a single-stream extractor network (SEN). It consists of a single-stream extractor pre-trained on bitemporal RS images using contrastive learning to mitigate the domain gap and high computational cost. Additionally, to improve feature modeling for difference information, we propose a shallow feature embedding (SFE) module and a cross-attention guided difference (CAGD) module, which enhance the representation of temporal features and extract the difference features explicitly. Extensive experiments and visualizations demonstrate the effectiveness and advanced performance of SEN. The code and model weights are available at https://github.com/mrazhou/SEN.
AB - Remote-sensing (RS) image change captioning (RSICC) is a visual semantic understanding task that has received increasing attention. The change captioning methods are required to understand the visual information of the images and capture the most significant difference between them, then describe it in natural language. Most existing methods mainly focus on improving the difference feature encoder or language decoder, while ignoring the visual feature extractor. The current feature extractors suffer from several issues, including: 1) domain gap between pre-training on single-temporal natural images and downstream bitemporal RS task; 2) limited difference feature modeling in the implicit single-stream network; and 3) high computational costs caused by extracting features for each temporal phase image under the dual-stream extractor. To address these issues, we propose a single-stream extractor network (SEN). It consists of a single-stream extractor pre-trained on bitemporal RS images using contrastive learning to mitigate the domain gap and high computational cost. Additionally, to improve feature modeling for difference information, we propose a shallow feature embedding (SFE) module and a cross-attention guided difference (CAGD) module, which enhance the representation of temporal features and extract the difference features explicitly. Extensive experiments and visualizations demonstrate the effectiveness and advanced performance of SEN. The code and model weights are available at https://github.com/mrazhou/SEN.
KW - Change captioning
KW - contrastive pre-training
KW - remote-sensing (RS) images
KW - single-stream
UR - http://www.scopus.com/inward/record.url?scp=85193257041&partnerID=8YFLogxK
U2 - 10.1109/TGRS.2024.3400966
DO - 10.1109/TGRS.2024.3400966
M3 - 文章
AN - SCOPUS:85193257041
SN - 0196-2892
VL - 62
SP - 1
EP - 14
JO - IEEE Transactions on Geoscience and Remote Sensing
JF - IEEE Transactions on Geoscience and Remote Sensing
M1 - 5624514
ER -