Abstract
Visual-Inertial Odometry (VIO) has become a key technology for autonomous systems by fusing visual and inertial data for robust self-localization. However, traditional VIO methods suffer from heavy parameter tuning, high computational cost, and limited robustness in dynamic environments. To address these issues, we propose a lightweight VIO framework that integrates two core components: an efficient dynamic perception network and a cross-modal consistency enhancement module. The standard convolutions in FlowNet are replaced with a dynamic perception network that leverages a two-stream feature generation module and a spatial-channel cooperative gating mechanism to capture long-range spatial dependencies while maintaining high computational efficiency. Furthermore, a novel fusion module is introduced to reduce latent discrepancies between heterogeneous visual and inertial modalities through a learnable shared mechanism. By adaptively aligning inertial features with visual features, this module enhances cross-modal complementarity and improves overall localization accuracy. Extensive experiments on multiple benchmark datasets demonstrate that the proposed framework achieves state-of-the-art performance while maintaining low complexity. Specifically, the method improves trajectory estimation precision by 61.6 % compared with the FlowNet-based baseline on KITTI.
| Original language | English |
|---|---|
| Article number | 112779 |
| Journal | Pattern Recognition |
| Volume | 173 |
| DOIs | |
| State | Published - May 2026 |
Keywords
- Consistency improvement fusion
- Deep learning
- Dynamic PerceptionNet
- Pose estimation
- Visual inertial odometry