TY - JOUR
T1 - Collaborative Multimodal Fusion Network for Multiagent Perception
AU - Zhang, Lei
AU - Wang, Binglu
AU - Zhao, Yongqiang
AU - Yuan, Yuan
AU - Zhou, Tianfei
AU - Li, Zhijun
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2024
Y1 - 2024
N2 - With the increasing popularity of autonomous driving systems and their applications in complex transportation scenarios, collaborative perception among multiple intelligent agents has become an important research direction. Existing single-agent multimodal fusion approaches are limited by their inability to leverage additional sensory data from nearby agents. In this article, we present the collaborative multimodal fusion network (CMMFNet) for distributed perception in multiagent systems. CMMFNet first extracts modality-specific features from LiDAR point clouds and camera images for each agent using dual-stream neural networks. To overcome the ambiguity in-depth prediction, we introduce a collaborative depth supervision module that projects dense fused point clouds onto image planes to generate more accurate depth ground truths. We then present modality-aware fusion strategies to aggregate homogeneous features across agents while preserving their distinctive properties. To align heterogeneous LiDAR and camera features, we introduce a modality consistency learning method. Finally, a transformer-based fusion module dynamically captures cross-modal correlations to produce a unified representation. Comprehensive evaluations on two extensive multiagent perception datasets, OPV2V and V2XSet, affirm the superiority of CMMFNet in detection performance, establishing a new benchmark in the field.
AB - With the increasing popularity of autonomous driving systems and their applications in complex transportation scenarios, collaborative perception among multiple intelligent agents has become an important research direction. Existing single-agent multimodal fusion approaches are limited by their inability to leverage additional sensory data from nearby agents. In this article, we present the collaborative multimodal fusion network (CMMFNet) for distributed perception in multiagent systems. CMMFNet first extracts modality-specific features from LiDAR point clouds and camera images for each agent using dual-stream neural networks. To overcome the ambiguity in-depth prediction, we introduce a collaborative depth supervision module that projects dense fused point clouds onto image planes to generate more accurate depth ground truths. We then present modality-aware fusion strategies to aggregate homogeneous features across agents while preserving their distinctive properties. To align heterogeneous LiDAR and camera features, we introduce a modality consistency learning method. Finally, a transformer-based fusion module dynamically captures cross-modal correlations to produce a unified representation. Comprehensive evaluations on two extensive multiagent perception datasets, OPV2V and V2XSet, affirm the superiority of CMMFNet in detection performance, establishing a new benchmark in the field.
KW - 3-D object detection
KW - autonomous driving
KW - collaborative perception
KW - multiagent system
KW - multimodal fusion
UR - http://www.scopus.com/inward/record.url?scp=85209947163&partnerID=8YFLogxK
U2 - 10.1109/TCYB.2024.3491756
DO - 10.1109/TCYB.2024.3491756
M3 - 文章
AN - SCOPUS:85209947163
SN - 2168-2267
JO - IEEE Transactions on Cybernetics
JF - IEEE Transactions on Cybernetics
ER -