Constructing a Multi-Modal based Underwater Acoustic Target Recognition Method with a Pre-trained Language-Audio Model

Bowen Fu; Jiangtao Nie; Wei Wei; Lei Zhang

doi:10.1109/TGRS.2024.3515171

Constructing a Multi-Modal based Underwater Acoustic Target Recognition Method with a Pre-trained Language-Audio Model

Bowen Fu, Jiangtao Nie, Wei Wei, Lei Zhang

计算机学院

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

Underwater Acoustic Target Recognition (UATR) aims to accurately identify radiated acoustic signals from ships in complex maritime environments. The challenges of this task lay in how to explore discriminative representation from complex and limited acoustic samples. Recently, various deep learning-based UATR methods have been proposed. However, their performance on real sonar-collected signals remains restricted. On one hand, most methods currently adopt different representation extraction strategies to extract features from acoustic signals such as time-frequency representation, wave representation, and joint representation. However, the limited feature representation capability and simple feature fusion strategies often limit the recognition performance improvement. On the other hand, they often overlook the knowledge gains brought by pre-trained models and the extraction of multi-feature semantic correlation knowledge. This leads to unsatisfactory performance and even overfitting issues. To mitigate these issues, this paper proposes a Multi-Feature Underwater Acoustic Target Recognition method (MF-UATR). It introduces a strongly generalized multi-modal pre-trained language-audio model and contrastive learning based feature-level fusion strategy to semantically guide and fuse multiple features. This strategy facilitates the model in learning prior knowledge and the semantic correlations between features thereby improving recognition performance. Additionally, we also considered the few-shot scenarios with extremely limited data, in which a Multi-Modal Few-Shot Underwater Acoustic Target Recognition (MMFS-UATR) scheme is proposed. It efficiently completes the few-shot underwater acoustic target recognition task by combining parameter-efficient fine-tuning techniques, semantic supervision strategy, and pre-trained MF-UATR. Extensive experiments on two public datasets, DeepShip and ShipsEar, demonstrate that the proposed frameworks achieve optimal target recognition performance under regular and few-shot settings.

源语言	英语
期刊	IEEE Transactions on Geoscience and Remote Sensing
DOI	https://doi.org/10.1109/TGRS.2024.3515171
出版状态	已接受/待刊 - 2024

访问文件

10.1109/TGRS.2024.3515171

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{ed8722a2f60f4bad8cd92f1a85d1975f,

title = "Constructing a Multi-Modal based Underwater Acoustic Target Recognition Method with a Pre-trained Language-Audio Model",

abstract = "Underwater Acoustic Target Recognition (UATR) aims to accurately identify radiated acoustic signals from ships in complex maritime environments. The challenges of this task lay in how to explore discriminative representation from complex and limited acoustic samples. Recently, various deep learning-based UATR methods have been proposed. However, their performance on real sonar-collected signals remains restricted. On one hand, most methods currently adopt different representation extraction strategies to extract features from acoustic signals such as time-frequency representation, wave representation, and joint representation. However, the limited feature representation capability and simple feature fusion strategies often limit the recognition performance improvement. On the other hand, they often overlook the knowledge gains brought by pre-trained models and the extraction of multi-feature semantic correlation knowledge. This leads to unsatisfactory performance and even overfitting issues. To mitigate these issues, this paper proposes a Multi-Feature Underwater Acoustic Target Recognition method (MF-UATR). It introduces a strongly generalized multi-modal pre-trained language-audio model and contrastive learning based feature-level fusion strategy to semantically guide and fuse multiple features. This strategy facilitates the model in learning prior knowledge and the semantic correlations between features thereby improving recognition performance. Additionally, we also considered the few-shot scenarios with extremely limited data, in which a Multi-Modal Few-Shot Underwater Acoustic Target Recognition (MMFS-UATR) scheme is proposed. It efficiently completes the few-shot underwater acoustic target recognition task by combining parameter-efficient fine-tuning techniques, semantic supervision strategy, and pre-trained MF-UATR. Extensive experiments on two public datasets, DeepShip and ShipsEar, demonstrate that the proposed frameworks achieve optimal target recognition performance under regular and few-shot settings.",

keywords = "few-shot learning, language-audio models, multi-feature fusion, Underwater acoustic target recognition",

author = "Bowen Fu and Jiangtao Nie and Wei Wei and Lei Zhang",

note = "Publisher Copyright: {\textcopyright} 1980-2012 IEEE.",

year = "2024",

doi = "10.1109/TGRS.2024.3515171",

language = "英语",

journal = "IEEE Transactions on Geoscience and Remote Sensing",

issn = "0196-2892",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Constructing a Multi-Modal based Underwater Acoustic Target Recognition Method with a Pre-trained Language-Audio Model

AU - Fu, Bowen

AU - Nie, Jiangtao

AU - Wei, Wei

AU - Zhang, Lei

PY - 2024

Y1 - 2024

N2 - Underwater Acoustic Target Recognition (UATR) aims to accurately identify radiated acoustic signals from ships in complex maritime environments. The challenges of this task lay in how to explore discriminative representation from complex and limited acoustic samples. Recently, various deep learning-based UATR methods have been proposed. However, their performance on real sonar-collected signals remains restricted. On one hand, most methods currently adopt different representation extraction strategies to extract features from acoustic signals such as time-frequency representation, wave representation, and joint representation. However, the limited feature representation capability and simple feature fusion strategies often limit the recognition performance improvement. On the other hand, they often overlook the knowledge gains brought by pre-trained models and the extraction of multi-feature semantic correlation knowledge. This leads to unsatisfactory performance and even overfitting issues. To mitigate these issues, this paper proposes a Multi-Feature Underwater Acoustic Target Recognition method (MF-UATR). It introduces a strongly generalized multi-modal pre-trained language-audio model and contrastive learning based feature-level fusion strategy to semantically guide and fuse multiple features. This strategy facilitates the model in learning prior knowledge and the semantic correlations between features thereby improving recognition performance. Additionally, we also considered the few-shot scenarios with extremely limited data, in which a Multi-Modal Few-Shot Underwater Acoustic Target Recognition (MMFS-UATR) scheme is proposed. It efficiently completes the few-shot underwater acoustic target recognition task by combining parameter-efficient fine-tuning techniques, semantic supervision strategy, and pre-trained MF-UATR. Extensive experiments on two public datasets, DeepShip and ShipsEar, demonstrate that the proposed frameworks achieve optimal target recognition performance under regular and few-shot settings.

AB - Underwater Acoustic Target Recognition (UATR) aims to accurately identify radiated acoustic signals from ships in complex maritime environments. The challenges of this task lay in how to explore discriminative representation from complex and limited acoustic samples. Recently, various deep learning-based UATR methods have been proposed. However, their performance on real sonar-collected signals remains restricted. On one hand, most methods currently adopt different representation extraction strategies to extract features from acoustic signals such as time-frequency representation, wave representation, and joint representation. However, the limited feature representation capability and simple feature fusion strategies often limit the recognition performance improvement. On the other hand, they often overlook the knowledge gains brought by pre-trained models and the extraction of multi-feature semantic correlation knowledge. This leads to unsatisfactory performance and even overfitting issues. To mitigate these issues, this paper proposes a Multi-Feature Underwater Acoustic Target Recognition method (MF-UATR). It introduces a strongly generalized multi-modal pre-trained language-audio model and contrastive learning based feature-level fusion strategy to semantically guide and fuse multiple features. This strategy facilitates the model in learning prior knowledge and the semantic correlations between features thereby improving recognition performance. Additionally, we also considered the few-shot scenarios with extremely limited data, in which a Multi-Modal Few-Shot Underwater Acoustic Target Recognition (MMFS-UATR) scheme is proposed. It efficiently completes the few-shot underwater acoustic target recognition task by combining parameter-efficient fine-tuning techniques, semantic supervision strategy, and pre-trained MF-UATR. Extensive experiments on two public datasets, DeepShip and ShipsEar, demonstrate that the proposed frameworks achieve optimal target recognition performance under regular and few-shot settings.

KW - few-shot learning

KW - language-audio models

KW - multi-feature fusion

KW - Underwater acoustic target recognition

UR - http://www.scopus.com/inward/record.url?scp=85211741744&partnerID=8YFLogxK

U2 - 10.1109/TGRS.2024.3515171

DO - 10.1109/TGRS.2024.3515171

M3 - 文章

AN - SCOPUS:85211741744

SN - 0196-2892

JO - IEEE Transactions on Geoscience and Remote Sensing

JF - IEEE Transactions on Geoscience and Remote Sensing

ER -

Constructing a Multi-Modal based Underwater Acoustic Target Recognition Method with a Pre-trained Language-Audio Model

摘要

访问文件

其它文件与链接

指纹

引用此