CMSE: Cross-Modal Semantic Enhancement Network for Classification of Hyperspectral and LiDAR Data

Wenqi Han; Wang Miao; Jie Geng; Wen Jiang

doi:10.1109/TGRS.2024.3368509

CMSE: Cross-Modal Semantic Enhancement Network for Classification of Hyperspectral and LiDAR Data

Wenqi Han, Wang Miao, Jie Geng, Wen Jiang

电子信息学院

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 文章 › 同行评审

11 引用（Scopus）

摘要

The fusion of hyperspectral image (HSI) and light detection and ranging (LiDAR) data is widely used for land cover classification. However, due to different imaging mechanisms, HSI and LiDAR data always present significant image differences, and the dimensions and feature distributions of HSI and LiDAR are highly dissimilar. This makes it challenging to represent and correlate semantic information from multimodal data. Current methods for classifying pixel-by-pixel features, which rely on cascaded or attention-based fusion, cannot effectively use multimodal features. To achieve accurate classification results, extracting and fusing similar high-order semantic information and complementary discriminative information contained in multimodal data is vital. In this article, we propose a cross-modal semantic enhancement network (CMSE) for multimodal semantic information mining and fusion. Our proposed CMSE framework extracts features from the image on multiple scales, capturing more representative local sparse features with different sizes of convolution kernels. To represent high-level semantic features related to land cover, we establish a Gaussian-weighted matrix and semantically transform the spatial and spectral features of distinct branches. Finally, we build a multilevel residual fusion module to incrementally fuse spectral features from HSI and elevation features from LiDAR. Additionally, we introduce a cross-modal semantically constrained loss to guide multimodal semantic feature alignment. We evaluate our approach on three multimodal remote sensing (RS) datasets, namely the Houston2013, Trento, and MUUFL datasets. The experimental results demonstrate that our proposed CMSE model achieves superior performance in terms of accuracy and robustness compared to other related deep networks.

源语言	英语
文章编号	5509814
页（从-至）	1-14
页数	14
期刊	IEEE Transactions on Geoscience and Remote Sensing
卷	62
DOI	https://doi.org/10.1109/TGRS.2024.3368509
出版状态	已出版 - 2024

访问文件

10.1109/TGRS.2024.3368509

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{b2ec2ab5da8448528b0c6363578782d8,

title = "CMSE: Cross-Modal Semantic Enhancement Network for Classification of Hyperspectral and LiDAR Data",

abstract = "The fusion of hyperspectral image (HSI) and light detection and ranging (LiDAR) data is widely used for land cover classification. However, due to different imaging mechanisms, HSI and LiDAR data always present significant image differences, and the dimensions and feature distributions of HSI and LiDAR are highly dissimilar. This makes it challenging to represent and correlate semantic information from multimodal data. Current methods for classifying pixel-by-pixel features, which rely on cascaded or attention-based fusion, cannot effectively use multimodal features. To achieve accurate classification results, extracting and fusing similar high-order semantic information and complementary discriminative information contained in multimodal data is vital. In this article, we propose a cross-modal semantic enhancement network (CMSE) for multimodal semantic information mining and fusion. Our proposed CMSE framework extracts features from the image on multiple scales, capturing more representative local sparse features with different sizes of convolution kernels. To represent high-level semantic features related to land cover, we establish a Gaussian-weighted matrix and semantically transform the spatial and spectral features of distinct branches. Finally, we build a multilevel residual fusion module to incrementally fuse spectral features from HSI and elevation features from LiDAR. Additionally, we introduce a cross-modal semantically constrained loss to guide multimodal semantic feature alignment. We evaluate our approach on three multimodal remote sensing (RS) datasets, namely the Houston2013, Trento, and MUUFL datasets. The experimental results demonstrate that our proposed CMSE model achieves superior performance in terms of accuracy and robustness compared to other related deep networks.",

keywords = "Classification, land cover, multimodal, remote sensing (RS), semantic features",

author = "Wenqi Han and Wang Miao and Jie Geng and Wen Jiang",

note = "Publisher Copyright: {\textcopyright} 1980-2012 IEEE.",

year = "2024",

doi = "10.1109/TGRS.2024.3368509",

language = "英语",

volume = "62",

pages = "1--14",

journal = "IEEE Transactions on Geoscience and Remote Sensing",

issn = "0196-2892",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - CMSE

T2 - Cross-Modal Semantic Enhancement Network for Classification of Hyperspectral and LiDAR Data

AU - Han, Wenqi

AU - Miao, Wang

AU - Geng, Jie

AU - Jiang, Wen

PY - 2024

Y1 - 2024

N2 - The fusion of hyperspectral image (HSI) and light detection and ranging (LiDAR) data is widely used for land cover classification. However, due to different imaging mechanisms, HSI and LiDAR data always present significant image differences, and the dimensions and feature distributions of HSI and LiDAR are highly dissimilar. This makes it challenging to represent and correlate semantic information from multimodal data. Current methods for classifying pixel-by-pixel features, which rely on cascaded or attention-based fusion, cannot effectively use multimodal features. To achieve accurate classification results, extracting and fusing similar high-order semantic information and complementary discriminative information contained in multimodal data is vital. In this article, we propose a cross-modal semantic enhancement network (CMSE) for multimodal semantic information mining and fusion. Our proposed CMSE framework extracts features from the image on multiple scales, capturing more representative local sparse features with different sizes of convolution kernels. To represent high-level semantic features related to land cover, we establish a Gaussian-weighted matrix and semantically transform the spatial and spectral features of distinct branches. Finally, we build a multilevel residual fusion module to incrementally fuse spectral features from HSI and elevation features from LiDAR. Additionally, we introduce a cross-modal semantically constrained loss to guide multimodal semantic feature alignment. We evaluate our approach on three multimodal remote sensing (RS) datasets, namely the Houston2013, Trento, and MUUFL datasets. The experimental results demonstrate that our proposed CMSE model achieves superior performance in terms of accuracy and robustness compared to other related deep networks.

AB - The fusion of hyperspectral image (HSI) and light detection and ranging (LiDAR) data is widely used for land cover classification. However, due to different imaging mechanisms, HSI and LiDAR data always present significant image differences, and the dimensions and feature distributions of HSI and LiDAR are highly dissimilar. This makes it challenging to represent and correlate semantic information from multimodal data. Current methods for classifying pixel-by-pixel features, which rely on cascaded or attention-based fusion, cannot effectively use multimodal features. To achieve accurate classification results, extracting and fusing similar high-order semantic information and complementary discriminative information contained in multimodal data is vital. In this article, we propose a cross-modal semantic enhancement network (CMSE) for multimodal semantic information mining and fusion. Our proposed CMSE framework extracts features from the image on multiple scales, capturing more representative local sparse features with different sizes of convolution kernels. To represent high-level semantic features related to land cover, we establish a Gaussian-weighted matrix and semantically transform the spatial and spectral features of distinct branches. Finally, we build a multilevel residual fusion module to incrementally fuse spectral features from HSI and elevation features from LiDAR. Additionally, we introduce a cross-modal semantically constrained loss to guide multimodal semantic feature alignment. We evaluate our approach on three multimodal remote sensing (RS) datasets, namely the Houston2013, Trento, and MUUFL datasets. The experimental results demonstrate that our proposed CMSE model achieves superior performance in terms of accuracy and robustness compared to other related deep networks.

KW - Classification

KW - land cover

KW - multimodal

KW - remote sensing (RS)

KW - semantic features

UR - http://www.scopus.com/inward/record.url?scp=85186093258&partnerID=8YFLogxK

U2 - 10.1109/TGRS.2024.3368509

DO - 10.1109/TGRS.2024.3368509

M3 - 文章

AN - SCOPUS:85186093258

SN - 0196-2892

VL - 62

SP - 1

EP - 14

JO - IEEE Transactions on Geoscience and Remote Sensing

JF - IEEE Transactions on Geoscience and Remote Sensing

M1 - 5509814

ER -

CMSE: Cross-Modal Semantic Enhancement Network for Classification of Hyperspectral and LiDAR Data

摘要

访问文件

其它文件与链接

指纹

引用此