Implicit neural representation model for camera relocalization in multiple scenes

Shun Yao, Yongmei Cheng, Fei Yang, Mikhail G. Mozerov

科研成果: 期刊稿件文章同行评审

摘要

One critical task in remote sensing is determining the position of a video camera relative to the scene depicted in a series of images captured by the camera. Classical approaches often necessitate pre-built scenario representations and the implementation of complex, time-consuming algorithms. Recent methods utilizing scene coordinate (SC) regression-based models have demonstrated promising performance in visual relocalization regarding accuracy and efficiency for a single scene. However, extending SC regression models to multiple scenes typically requires retraining model parameters or constructing reference landmarks, which is a time-consuming process. This paper proposes representing multiple scenes within a global reference coordinate system to efficiently train a single SC regression model in one training procedure. We encode scene information in scene embeddings as a prior condition for our model predictions. We design a scene-conditional regression-adjust (SCRA) module to adapt the model to the scene embedding by dynamically generating parameters during inference. Additionally, we employ modulation and complement modules to enhance the model's prediction applicability at both the image sample and scene levels. The modulation module adjusts the amplitude, phase, and frequency of the data flow for each input image, while the complement module derives scene-specific coordinate biases to reduce distribution differences between scenes. Extensive experiments on indoor and outdoor datasets validate our model's efficiency and accuracy in multi-scene visual relocalization. Compared to the state-of-the-art MS-Transformer model, our model requires less training time and achieves more accurate relocalization results, with reductions in average median errors of position and rotation by 50.0% and 52.0% on the Cambridge Landmarks dataset, and by 61.1% and 73.9% on the 7Scenes dataset. Compared to the separately trained advanced FeatLoc++Au model, our model achieves relative improvements in average median errors of position and rotation by 64.6% and 81.0% on the Cambridge Landmarks dataset, by 50.0% and 67.7% on the 7Scenes dataset, and by 73.7% and 41.5% on the 12Scenes dataset. We release our source code at https://github.com/AlcibiadesTophetScipio/SCINR.

源语言英语
文章编号111791
期刊Pattern Recognition
168
DOI
出版状态已出版 - 12月 2025

指纹

探究 'Implicit neural representation model for camera relocalization in multiple scenes' 的科研主题。它们共同构成独一无二的指纹。

引用此