Semantic-Spatial Collaborative Perception Network for Remote Sensing Image Captioning

Qi Wang; Zhigang Yang; Weiping Ni; Junzheng Wu; Qiang Li

doi:10.1109/TGRS.2024.3502805

Semantic-Spatial Collaborative Perception Network for Remote Sensing Image Captioning

Qi Wang, Zhigang Yang, Weiping Ni, Junzheng Wu, Qiang Li

光电与智能研究院

科研成果: 期刊稿件 › 文章 › 同行评审

4 引用（Scopus）

摘要

Image captioning is a fundamental vision-language task with wide-ranging applications in daily life. The existing methods often struggle to accurately interpret the semantic information in remote sensing images due to the complexity of backgrounds. Target region masks can effectively reflect the shape characteristics of targets and their potential interrelationships. Therefore, incorporating and fully integrating these features can significantly improve the quality of generated captions. However, researchers are hindered by the lack of relevant datasets that contain corresponding object masks. It is natural to ask the following: how to efficiently introduce and utilize object masks? In this article, we provide potential target masks for the publicly available remote sensing image caption (RSIC) datasets, enabling models to utilize the regional features of targets for RSIC. Meanwhile, a novel RSIC algorithm is proposed that combines regional positional features with fine-grained semantic information, abbreviated as S² CPNet. To effectively capture the semantic information from image and position relationship from mask, respectively, the semantic and spatial feature enhancement submodules are introduced at the ends of encoder branches, respectively. Furthermore, the cross-view feature fusion module is designed to integrate regional features and semantic information efficiently. Then, a target recognition decoder is developed to enhance the ability of model to identify and describe critical targets in images. Finally, we improve the caption generation decoder by adaptively merging textual information with visual features to generate more accurate descriptions. Our model achieves satisfactory results on three RSIC datasets compared with the existing method. The related datasets and code will be open-sourced in https://github.com/CVer-Yang/SSCPNet.

源语言	英语
文章编号	5649912
期刊	IEEE Transactions on Geoscience and Remote Sensing
卷	62
DOI	https://doi.org/10.1109/TGRS.2024.3502805
出版状态	已出版 - 2024

访问文件

10.1109/TGRS.2024.3502805

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{6206986c7f3d42719007053d1f4756c0,

title = "Semantic-Spatial Collaborative Perception Network for Remote Sensing Image Captioning",

abstract = "Image captioning is a fundamental vision-language task with wide-ranging applications in daily life. The existing methods often struggle to accurately interpret the semantic information in remote sensing images due to the complexity of backgrounds. Target region masks can effectively reflect the shape characteristics of targets and their potential interrelationships. Therefore, incorporating and fully integrating these features can significantly improve the quality of generated captions. However, researchers are hindered by the lack of relevant datasets that contain corresponding object masks. It is natural to ask the following: how to efficiently introduce and utilize object masks? In this article, we provide potential target masks for the publicly available remote sensing image caption (RSIC) datasets, enabling models to utilize the regional features of targets for RSIC. Meanwhile, a novel RSIC algorithm is proposed that combines regional positional features with fine-grained semantic information, abbreviated as S2 CPNet. To effectively capture the semantic information from image and position relationship from mask, respectively, the semantic and spatial feature enhancement submodules are introduced at the ends of encoder branches, respectively. Furthermore, the cross-view feature fusion module is designed to integrate regional features and semantic information efficiently. Then, a target recognition decoder is developed to enhance the ability of model to identify and describe critical targets in images. Finally, we improve the caption generation decoder by adaptively merging textual information with visual features to generate more accurate descriptions. Our model achieves satisfactory results on three RSIC datasets compared with the existing method. The related datasets and code will be open-sourced in https://github.com/CVer-Yang/SSCPNet.",

keywords = "Attention mechanism, cross view, image captioning, remote sensing",

author = "Qi Wang and Zhigang Yang and Weiping Ni and Junzheng Wu and Qiang Li",

note = "Publisher Copyright: {\textcopyright} 1980-2012 IEEE.",

year = "2024",

doi = "10.1109/TGRS.2024.3502805",

language = "英语",

volume = "62",

journal = "IEEE Transactions on Geoscience and Remote Sensing",

issn = "0196-2892",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Semantic-Spatial Collaborative Perception Network for Remote Sensing Image Captioning

AU - Wang, Qi

AU - Yang, Zhigang

AU - Ni, Weiping

AU - Wu, Junzheng

AU - Li, Qiang

PY - 2024

Y1 - 2024

N2 - Image captioning is a fundamental vision-language task with wide-ranging applications in daily life. The existing methods often struggle to accurately interpret the semantic information in remote sensing images due to the complexity of backgrounds. Target region masks can effectively reflect the shape characteristics of targets and their potential interrelationships. Therefore, incorporating and fully integrating these features can significantly improve the quality of generated captions. However, researchers are hindered by the lack of relevant datasets that contain corresponding object masks. It is natural to ask the following: how to efficiently introduce and utilize object masks? In this article, we provide potential target masks for the publicly available remote sensing image caption (RSIC) datasets, enabling models to utilize the regional features of targets for RSIC. Meanwhile, a novel RSIC algorithm is proposed that combines regional positional features with fine-grained semantic information, abbreviated as S2 CPNet. To effectively capture the semantic information from image and position relationship from mask, respectively, the semantic and spatial feature enhancement submodules are introduced at the ends of encoder branches, respectively. Furthermore, the cross-view feature fusion module is designed to integrate regional features and semantic information efficiently. Then, a target recognition decoder is developed to enhance the ability of model to identify and describe critical targets in images. Finally, we improve the caption generation decoder by adaptively merging textual information with visual features to generate more accurate descriptions. Our model achieves satisfactory results on three RSIC datasets compared with the existing method. The related datasets and code will be open-sourced in https://github.com/CVer-Yang/SSCPNet.

AB - Image captioning is a fundamental vision-language task with wide-ranging applications in daily life. The existing methods often struggle to accurately interpret the semantic information in remote sensing images due to the complexity of backgrounds. Target region masks can effectively reflect the shape characteristics of targets and their potential interrelationships. Therefore, incorporating and fully integrating these features can significantly improve the quality of generated captions. However, researchers are hindered by the lack of relevant datasets that contain corresponding object masks. It is natural to ask the following: how to efficiently introduce and utilize object masks? In this article, we provide potential target masks for the publicly available remote sensing image caption (RSIC) datasets, enabling models to utilize the regional features of targets for RSIC. Meanwhile, a novel RSIC algorithm is proposed that combines regional positional features with fine-grained semantic information, abbreviated as S2 CPNet. To effectively capture the semantic information from image and position relationship from mask, respectively, the semantic and spatial feature enhancement submodules are introduced at the ends of encoder branches, respectively. Furthermore, the cross-view feature fusion module is designed to integrate regional features and semantic information efficiently. Then, a target recognition decoder is developed to enhance the ability of model to identify and describe critical targets in images. Finally, we improve the caption generation decoder by adaptively merging textual information with visual features to generate more accurate descriptions. Our model achieves satisfactory results on three RSIC datasets compared with the existing method. The related datasets and code will be open-sourced in https://github.com/CVer-Yang/SSCPNet.

KW - Attention mechanism

KW - cross view

KW - image captioning

KW - remote sensing

UR - http://www.scopus.com/inward/record.url?scp=85210148904&partnerID=8YFLogxK

U2 - 10.1109/TGRS.2024.3502805

DO - 10.1109/TGRS.2024.3502805

M3 - 文章

AN - SCOPUS:85210148904

SN - 0196-2892

VL - 62

JO - IEEE Transactions on Geoscience and Remote Sensing

JF - IEEE Transactions on Geoscience and Remote Sensing

M1 - 5649912

ER -

Semantic-Spatial Collaborative Perception Network for Remote Sensing Image Captioning

摘要

访问文件

其它文件与链接

指纹

引用此