CMSE: Cross-Modal Semantic Enhancement Network for Classification of Hyperspectral and LiDAR Data

Wenqi Han; Wang Miao; Jie Geng; Wen Jiang

doi:10.1109/TGRS.2024.3368509

CMSE: Cross-Modal Semantic Enhancement Network for Classification of Hyperspectral and LiDAR Data

Wenqi Han, Wang Miao, Jie Geng, Wen Jiang

School of Electronics and Information

Northwestern Polytechnical University Xian

Research output: Contribution to journal › Article › peer-review

11 Scopus citations

Abstract

The fusion of hyperspectral image (HSI) and light detection and ranging (LiDAR) data is widely used for land cover classification. However, due to different imaging mechanisms, HSI and LiDAR data always present significant image differences, and the dimensions and feature distributions of HSI and LiDAR are highly dissimilar. This makes it challenging to represent and correlate semantic information from multimodal data. Current methods for classifying pixel-by-pixel features, which rely on cascaded or attention-based fusion, cannot effectively use multimodal features. To achieve accurate classification results, extracting and fusing similar high-order semantic information and complementary discriminative information contained in multimodal data is vital. In this article, we propose a cross-modal semantic enhancement network (CMSE) for multimodal semantic information mining and fusion. Our proposed CMSE framework extracts features from the image on multiple scales, capturing more representative local sparse features with different sizes of convolution kernels. To represent high-level semantic features related to land cover, we establish a Gaussian-weighted matrix and semantically transform the spatial and spectral features of distinct branches. Finally, we build a multilevel residual fusion module to incrementally fuse spectral features from HSI and elevation features from LiDAR. Additionally, we introduce a cross-modal semantically constrained loss to guide multimodal semantic feature alignment. We evaluate our approach on three multimodal remote sensing (RS) datasets, namely the Houston2013, Trento, and MUUFL datasets. The experimental results demonstrate that our proposed CMSE model achieves superior performance in terms of accuracy and robustness compared to other related deep networks.

Original language	English
Article number	5509814
Pages (from-to)	1-14
Number of pages	14
Journal	IEEE Transactions on Geoscience and Remote Sensing
Volume	62
DOIs	https://doi.org/10.1109/TGRS.2024.3368509
State	Published - 2024

Keywords

Classification
land cover
multimodal
remote sensing (RS)
semantic features

Access to Document

10.1109/TGRS.2024.3368509

Cite this

@article{b2ec2ab5da8448528b0c6363578782d8,

title = "CMSE: Cross-Modal Semantic Enhancement Network for Classification of Hyperspectral and LiDAR Data",

abstract = "The fusion of hyperspectral image (HSI) and light detection and ranging (LiDAR) data is widely used for land cover classification. However, due to different imaging mechanisms, HSI and LiDAR data always present significant image differences, and the dimensions and feature distributions of HSI and LiDAR are highly dissimilar. This makes it challenging to represent and correlate semantic information from multimodal data. Current methods for classifying pixel-by-pixel features, which rely on cascaded or attention-based fusion, cannot effectively use multimodal features. To achieve accurate classification results, extracting and fusing similar high-order semantic information and complementary discriminative information contained in multimodal data is vital. In this article, we propose a cross-modal semantic enhancement network (CMSE) for multimodal semantic information mining and fusion. Our proposed CMSE framework extracts features from the image on multiple scales, capturing more representative local sparse features with different sizes of convolution kernels. To represent high-level semantic features related to land cover, we establish a Gaussian-weighted matrix and semantically transform the spatial and spectral features of distinct branches. Finally, we build a multilevel residual fusion module to incrementally fuse spectral features from HSI and elevation features from LiDAR. Additionally, we introduce a cross-modal semantically constrained loss to guide multimodal semantic feature alignment. We evaluate our approach on three multimodal remote sensing (RS) datasets, namely the Houston2013, Trento, and MUUFL datasets. The experimental results demonstrate that our proposed CMSE model achieves superior performance in terms of accuracy and robustness compared to other related deep networks.",

keywords = "Classification, land cover, multimodal, remote sensing (RS), semantic features",

author = "Wenqi Han and Wang Miao and Jie Geng and Wen Jiang",

note = "Publisher Copyright: {\textcopyright} 1980-2012 IEEE.",

year = "2024",

doi = "10.1109/TGRS.2024.3368509",

language = "英语",

volume = "62",

pages = "1--14",

journal = "IEEE Transactions on Geoscience and Remote Sensing",

issn = "0196-2892",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - CMSE

T2 - Cross-Modal Semantic Enhancement Network for Classification of Hyperspectral and LiDAR Data

AU - Han, Wenqi

AU - Miao, Wang

AU - Geng, Jie

AU - Jiang, Wen

PY - 2024

Y1 - 2024

N2 - The fusion of hyperspectral image (HSI) and light detection and ranging (LiDAR) data is widely used for land cover classification. However, due to different imaging mechanisms, HSI and LiDAR data always present significant image differences, and the dimensions and feature distributions of HSI and LiDAR are highly dissimilar. This makes it challenging to represent and correlate semantic information from multimodal data. Current methods for classifying pixel-by-pixel features, which rely on cascaded or attention-based fusion, cannot effectively use multimodal features. To achieve accurate classification results, extracting and fusing similar high-order semantic information and complementary discriminative information contained in multimodal data is vital. In this article, we propose a cross-modal semantic enhancement network (CMSE) for multimodal semantic information mining and fusion. Our proposed CMSE framework extracts features from the image on multiple scales, capturing more representative local sparse features with different sizes of convolution kernels. To represent high-level semantic features related to land cover, we establish a Gaussian-weighted matrix and semantically transform the spatial and spectral features of distinct branches. Finally, we build a multilevel residual fusion module to incrementally fuse spectral features from HSI and elevation features from LiDAR. Additionally, we introduce a cross-modal semantically constrained loss to guide multimodal semantic feature alignment. We evaluate our approach on three multimodal remote sensing (RS) datasets, namely the Houston2013, Trento, and MUUFL datasets. The experimental results demonstrate that our proposed CMSE model achieves superior performance in terms of accuracy and robustness compared to other related deep networks.

AB - The fusion of hyperspectral image (HSI) and light detection and ranging (LiDAR) data is widely used for land cover classification. However, due to different imaging mechanisms, HSI and LiDAR data always present significant image differences, and the dimensions and feature distributions of HSI and LiDAR are highly dissimilar. This makes it challenging to represent and correlate semantic information from multimodal data. Current methods for classifying pixel-by-pixel features, which rely on cascaded or attention-based fusion, cannot effectively use multimodal features. To achieve accurate classification results, extracting and fusing similar high-order semantic information and complementary discriminative information contained in multimodal data is vital. In this article, we propose a cross-modal semantic enhancement network (CMSE) for multimodal semantic information mining and fusion. Our proposed CMSE framework extracts features from the image on multiple scales, capturing more representative local sparse features with different sizes of convolution kernels. To represent high-level semantic features related to land cover, we establish a Gaussian-weighted matrix and semantically transform the spatial and spectral features of distinct branches. Finally, we build a multilevel residual fusion module to incrementally fuse spectral features from HSI and elevation features from LiDAR. Additionally, we introduce a cross-modal semantically constrained loss to guide multimodal semantic feature alignment. We evaluate our approach on three multimodal remote sensing (RS) datasets, namely the Houston2013, Trento, and MUUFL datasets. The experimental results demonstrate that our proposed CMSE model achieves superior performance in terms of accuracy and robustness compared to other related deep networks.

KW - Classification

KW - land cover

KW - multimodal

KW - remote sensing (RS)

KW - semantic features

UR - http://www.scopus.com/inward/record.url?scp=85186093258&partnerID=8YFLogxK

U2 - 10.1109/TGRS.2024.3368509

DO - 10.1109/TGRS.2024.3368509

M3 - 文章

AN - SCOPUS:85186093258

SN - 0196-2892

VL - 62

SP - 1

EP - 14

JO - IEEE Transactions on Geoscience and Remote Sensing

JF - IEEE Transactions on Geoscience and Remote Sensing

M1 - 5509814

ER -

CMSE: Cross-Modal Semantic Enhancement Network for Classification of Hyperspectral and LiDAR Data

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this