Single-Stream Extractor Network with Contrastive Pre-Training for Remote-Sensing Change Captioning

Qing Zhou; Junyu Gao; Yuan Yuan; Qi Wang

doi:10.1109/TGRS.2024.3400966

Single-Stream Extractor Network with Contrastive Pre-Training for Remote-Sensing Change Captioning

Qing Zhou, Junyu Gao, Yuan Yuan, Qi Wang

光电与智能研究院

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 文章 › 同行评审

6 引用（Scopus）

摘要

Remote-sensing (RS) image change captioning (RSICC) is a visual semantic understanding task that has received increasing attention. The change captioning methods are required to understand the visual information of the images and capture the most significant difference between them, then describe it in natural language. Most existing methods mainly focus on improving the difference feature encoder or language decoder, while ignoring the visual feature extractor. The current feature extractors suffer from several issues, including: 1) domain gap between pre-training on single-temporal natural images and downstream bitemporal RS task; 2) limited difference feature modeling in the implicit single-stream network; and 3) high computational costs caused by extracting features for each temporal phase image under the dual-stream extractor. To address these issues, we propose a single-stream extractor network (SEN). It consists of a single-stream extractor pre-trained on bitemporal RS images using contrastive learning to mitigate the domain gap and high computational cost. Additionally, to improve feature modeling for difference information, we propose a shallow feature embedding (SFE) module and a cross-attention guided difference (CAGD) module, which enhance the representation of temporal features and extract the difference features explicitly. Extensive experiments and visualizations demonstrate the effectiveness and advanced performance of SEN. The code and model weights are available at https://github.com/mrazhou/SEN.

源语言	英语
文章编号	5624514
页（从-至）	1-14
页数	14
期刊	IEEE Transactions on Geoscience and Remote Sensing
卷	62
DOI	https://doi.org/10.1109/TGRS.2024.3400966
出版状态	已出版 - 2024

访问文件

10.1109/TGRS.2024.3400966

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{adc81a0efc79440aadc835b6f39b4823,

title = "Single-Stream Extractor Network with Contrastive Pre-Training for Remote-Sensing Change Captioning",

abstract = "Remote-sensing (RS) image change captioning (RSICC) is a visual semantic understanding task that has received increasing attention. The change captioning methods are required to understand the visual information of the images and capture the most significant difference between them, then describe it in natural language. Most existing methods mainly focus on improving the difference feature encoder or language decoder, while ignoring the visual feature extractor. The current feature extractors suffer from several issues, including: 1) domain gap between pre-training on single-temporal natural images and downstream bitemporal RS task; 2) limited difference feature modeling in the implicit single-stream network; and 3) high computational costs caused by extracting features for each temporal phase image under the dual-stream extractor. To address these issues, we propose a single-stream extractor network (SEN). It consists of a single-stream extractor pre-trained on bitemporal RS images using contrastive learning to mitigate the domain gap and high computational cost. Additionally, to improve feature modeling for difference information, we propose a shallow feature embedding (SFE) module and a cross-attention guided difference (CAGD) module, which enhance the representation of temporal features and extract the difference features explicitly. Extensive experiments and visualizations demonstrate the effectiveness and advanced performance of SEN. The code and model weights are available at https://github.com/mrazhou/SEN.",

keywords = "Change captioning, contrastive pre-training, remote-sensing (RS) images, single-stream",

author = "Qing Zhou and Junyu Gao and Yuan Yuan and Qi Wang",

note = "Publisher Copyright: {\textcopyright} 1980-2012 IEEE.",

year = "2024",

doi = "10.1109/TGRS.2024.3400966",

language = "英语",

volume = "62",

pages = "1--14",

journal = "IEEE Transactions on Geoscience and Remote Sensing",

issn = "0196-2892",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Single-Stream Extractor Network with Contrastive Pre-Training for Remote-Sensing Change Captioning

AU - Zhou, Qing

AU - Gao, Junyu

AU - Yuan, Yuan

AU - Wang, Qi

PY - 2024

Y1 - 2024

N2 - Remote-sensing (RS) image change captioning (RSICC) is a visual semantic understanding task that has received increasing attention. The change captioning methods are required to understand the visual information of the images and capture the most significant difference between them, then describe it in natural language. Most existing methods mainly focus on improving the difference feature encoder or language decoder, while ignoring the visual feature extractor. The current feature extractors suffer from several issues, including: 1) domain gap between pre-training on single-temporal natural images and downstream bitemporal RS task; 2) limited difference feature modeling in the implicit single-stream network; and 3) high computational costs caused by extracting features for each temporal phase image under the dual-stream extractor. To address these issues, we propose a single-stream extractor network (SEN). It consists of a single-stream extractor pre-trained on bitemporal RS images using contrastive learning to mitigate the domain gap and high computational cost. Additionally, to improve feature modeling for difference information, we propose a shallow feature embedding (SFE) module and a cross-attention guided difference (CAGD) module, which enhance the representation of temporal features and extract the difference features explicitly. Extensive experiments and visualizations demonstrate the effectiveness and advanced performance of SEN. The code and model weights are available at https://github.com/mrazhou/SEN.

AB - Remote-sensing (RS) image change captioning (RSICC) is a visual semantic understanding task that has received increasing attention. The change captioning methods are required to understand the visual information of the images and capture the most significant difference between them, then describe it in natural language. Most existing methods mainly focus on improving the difference feature encoder or language decoder, while ignoring the visual feature extractor. The current feature extractors suffer from several issues, including: 1) domain gap between pre-training on single-temporal natural images and downstream bitemporal RS task; 2) limited difference feature modeling in the implicit single-stream network; and 3) high computational costs caused by extracting features for each temporal phase image under the dual-stream extractor. To address these issues, we propose a single-stream extractor network (SEN). It consists of a single-stream extractor pre-trained on bitemporal RS images using contrastive learning to mitigate the domain gap and high computational cost. Additionally, to improve feature modeling for difference information, we propose a shallow feature embedding (SFE) module and a cross-attention guided difference (CAGD) module, which enhance the representation of temporal features and extract the difference features explicitly. Extensive experiments and visualizations demonstrate the effectiveness and advanced performance of SEN. The code and model weights are available at https://github.com/mrazhou/SEN.

KW - Change captioning

KW - contrastive pre-training

KW - remote-sensing (RS) images

KW - single-stream

UR - http://www.scopus.com/inward/record.url?scp=85193257041&partnerID=8YFLogxK

U2 - 10.1109/TGRS.2024.3400966

DO - 10.1109/TGRS.2024.3400966

M3 - 文章

AN - SCOPUS:85193257041

SN - 0196-2892

VL - 62

SP - 1

EP - 14

JO - IEEE Transactions on Geoscience and Remote Sensing

JF - IEEE Transactions on Geoscience and Remote Sensing

M1 - 5624514

ER -

Single-Stream Extractor Network with Contrastive Pre-Training for Remote-Sensing Change Captioning

摘要

访问文件

其它文件与链接

指纹

引用此