TY - JOUR
T1 - CDFNet
T2 - Cross-dimension fusion network with dual feature enhancement for multimodal object detection
AU - Wu, Wencong
AU - Zhang, Xiuwei
AU - Yin, Hanlin
AU - Zeng, Haorui
AU - Wei, Chenxu
AU - Yu, Lei
AU - Zhang, Yanning
N1 - Publisher Copyright:
© 2026 Elsevier Ltd. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
PY - 2026/8/1
Y1 - 2026/8/1
N2 - Multimodal object detection aims to utilize the complementarity between different modalities to improve detection results. However, most existing methods only enhance intermodality features by leveraging the interaction of spatial information while neglecting the interaction of channel information between multimodalities, resulting in insufficient enhancement of cross-modal features. Moreover, many detection models fuse multimodal features within a single feature dimension, failing to consider the use of multi-dimensional information, which means that multimodal feature information has not been fully exploited. To solve these drawbacks, we propose a cross-dimension fusion network with dual feature enhancement (CDFNet) for visible and infrared object detection. Specifically, a dual feature enhancement module (DFEM) is designed to enhance cross-modal representations by modeling multiplicative interactions at both spatial and channel levels. Furthermore, a cross-dimension feature fusion module (CDFFM) is developed for fully integrating the enhanced features by capturing different dimensional dependencies to obtain a more discriminative fused feature. Extensive experiments demonstrate that our proposed CDFNet achieves a 1.8% higher mAP detection accuracy on the LLVIP dataset compared to the state-of-the-art detection method, and exhibits more competitive network complexity than transformer-based and mamba-based models. The code of our CDFNet is released at https://github.com/WenCongWu/CDFNet.
AB - Multimodal object detection aims to utilize the complementarity between different modalities to improve detection results. However, most existing methods only enhance intermodality features by leveraging the interaction of spatial information while neglecting the interaction of channel information between multimodalities, resulting in insufficient enhancement of cross-modal features. Moreover, many detection models fuse multimodal features within a single feature dimension, failing to consider the use of multi-dimensional information, which means that multimodal feature information has not been fully exploited. To solve these drawbacks, we propose a cross-dimension fusion network with dual feature enhancement (CDFNet) for visible and infrared object detection. Specifically, a dual feature enhancement module (DFEM) is designed to enhance cross-modal representations by modeling multiplicative interactions at both spatial and channel levels. Furthermore, a cross-dimension feature fusion module (CDFFM) is developed for fully integrating the enhanced features by capturing different dimensional dependencies to obtain a more discriminative fused feature. Extensive experiments demonstrate that our proposed CDFNet achieves a 1.8% higher mAP detection accuracy on the LLVIP dataset compared to the state-of-the-art detection method, and exhibits more competitive network complexity than transformer-based and mamba-based models. The code of our CDFNet is released at https://github.com/WenCongWu/CDFNet.
KW - Multimodal object detection
KW - cross-dimension feature fusion
KW - feature enhancement
KW - feature interaction
UR - https://www.scopus.com/pages/publications/105035671611
U2 - 10.1016/j.eswa.2026.132380
DO - 10.1016/j.eswa.2026.132380
M3 - 文章
AN - SCOPUS:105035671611
SN - 0957-4174
VL - 322
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 132380
ER -