Collaborative Multimodal Fusion Network for Multiagent Perception

Lei Zhang; Binglu Wang; Yongqiang Zhao; Yuan Yuan; Tianfei Zhou; Zhijun Li

doi:10.1109/TCYB.2024.3491756

Collaborative Multimodal Fusion Network for Multiagent Perception

Lei Zhang, Binglu Wang, Yongqiang Zhao, Yuan Yuan, Tianfei Zhou, Zhijun Li

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

With the increasing popularity of autonomous driving systems and their applications in complex transportation scenarios, collaborative perception among multiple intelligent agents has become an important research direction. Existing single-agent multimodal fusion approaches are limited by their inability to leverage additional sensory data from nearby agents. In this article, we present the collaborative multimodal fusion network (CMMFNet) for distributed perception in multiagent systems. CMMFNet first extracts modality-specific features from LiDAR point clouds and camera images for each agent using dual-stream neural networks. To overcome the ambiguity in-depth prediction, we introduce a collaborative depth supervision module that projects dense fused point clouds onto image planes to generate more accurate depth ground truths. We then present modality-aware fusion strategies to aggregate homogeneous features across agents while preserving their distinctive properties. To align heterogeneous LiDAR and camera features, we introduce a modality consistency learning method. Finally, a transformer-based fusion module dynamically captures cross-modal correlations to produce a unified representation. Comprehensive evaluations on two extensive multiagent perception datasets, OPV2V and V2XSet, affirm the superiority of CMMFNet in detection performance, establishing a new benchmark in the field.

源语言	英语
期刊	IEEE Transactions on Cybernetics
DOI	https://doi.org/10.1109/TCYB.2024.3491756
出版状态	已接受/待刊 - 2024

访问文件

10.1109/TCYB.2024.3491756

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{475eb07a3b7b481dbee98012914e7f56,

title = "Collaborative Multimodal Fusion Network for Multiagent Perception",

abstract = "With the increasing popularity of autonomous driving systems and their applications in complex transportation scenarios, collaborative perception among multiple intelligent agents has become an important research direction. Existing single-agent multimodal fusion approaches are limited by their inability to leverage additional sensory data from nearby agents. In this article, we present the collaborative multimodal fusion network (CMMFNet) for distributed perception in multiagent systems. CMMFNet first extracts modality-specific features from LiDAR point clouds and camera images for each agent using dual-stream neural networks. To overcome the ambiguity in-depth prediction, we introduce a collaborative depth supervision module that projects dense fused point clouds onto image planes to generate more accurate depth ground truths. We then present modality-aware fusion strategies to aggregate homogeneous features across agents while preserving their distinctive properties. To align heterogeneous LiDAR and camera features, we introduce a modality consistency learning method. Finally, a transformer-based fusion module dynamically captures cross-modal correlations to produce a unified representation. Comprehensive evaluations on two extensive multiagent perception datasets, OPV2V and V2XSet, affirm the superiority of CMMFNet in detection performance, establishing a new benchmark in the field.",

keywords = "3-D object detection, autonomous driving, collaborative perception, multiagent system, multimodal fusion",

author = "Lei Zhang and Binglu Wang and Yongqiang Zhao and Yuan Yuan and Tianfei Zhou and Zhijun Li",

note = "Publisher Copyright: {\textcopyright} 2013 IEEE.",

year = "2024",

doi = "10.1109/TCYB.2024.3491756",

language = "英语",

journal = "IEEE Transactions on Cybernetics",

issn = "2168-2267",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Collaborative Multimodal Fusion Network for Multiagent Perception

AU - Zhang, Lei

AU - Wang, Binglu

AU - Zhao, Yongqiang

AU - Yuan, Yuan

AU - Zhou, Tianfei

AU - Li, Zhijun

PY - 2024

Y1 - 2024

N2 - With the increasing popularity of autonomous driving systems and their applications in complex transportation scenarios, collaborative perception among multiple intelligent agents has become an important research direction. Existing single-agent multimodal fusion approaches are limited by their inability to leverage additional sensory data from nearby agents. In this article, we present the collaborative multimodal fusion network (CMMFNet) for distributed perception in multiagent systems. CMMFNet first extracts modality-specific features from LiDAR point clouds and camera images for each agent using dual-stream neural networks. To overcome the ambiguity in-depth prediction, we introduce a collaborative depth supervision module that projects dense fused point clouds onto image planes to generate more accurate depth ground truths. We then present modality-aware fusion strategies to aggregate homogeneous features across agents while preserving their distinctive properties. To align heterogeneous LiDAR and camera features, we introduce a modality consistency learning method. Finally, a transformer-based fusion module dynamically captures cross-modal correlations to produce a unified representation. Comprehensive evaluations on two extensive multiagent perception datasets, OPV2V and V2XSet, affirm the superiority of CMMFNet in detection performance, establishing a new benchmark in the field.

AB - With the increasing popularity of autonomous driving systems and their applications in complex transportation scenarios, collaborative perception among multiple intelligent agents has become an important research direction. Existing single-agent multimodal fusion approaches are limited by their inability to leverage additional sensory data from nearby agents. In this article, we present the collaborative multimodal fusion network (CMMFNet) for distributed perception in multiagent systems. CMMFNet first extracts modality-specific features from LiDAR point clouds and camera images for each agent using dual-stream neural networks. To overcome the ambiguity in-depth prediction, we introduce a collaborative depth supervision module that projects dense fused point clouds onto image planes to generate more accurate depth ground truths. We then present modality-aware fusion strategies to aggregate homogeneous features across agents while preserving their distinctive properties. To align heterogeneous LiDAR and camera features, we introduce a modality consistency learning method. Finally, a transformer-based fusion module dynamically captures cross-modal correlations to produce a unified representation. Comprehensive evaluations on two extensive multiagent perception datasets, OPV2V and V2XSet, affirm the superiority of CMMFNet in detection performance, establishing a new benchmark in the field.

KW - 3-D object detection

KW - autonomous driving

KW - collaborative perception

KW - multiagent system

KW - multimodal fusion

UR - http://www.scopus.com/inward/record.url?scp=85209947163&partnerID=8YFLogxK

U2 - 10.1109/TCYB.2024.3491756

DO - 10.1109/TCYB.2024.3491756

M3 - 文章

AN - SCOPUS:85209947163

SN - 2168-2267

JO - IEEE Transactions on Cybernetics

JF - IEEE Transactions on Cybernetics

ER -

Collaborative Multimodal Fusion Network for Multiagent Perception

摘要

访问文件

其它文件与链接

指纹

引用此