Implicit neural representation model for camera relocalization in multiple scenes

Shun Yao, Yongmei Cheng, Fei Yang, Mikhail G. Mozerov

Research output: Contribution to journalArticlepeer-review

Abstract

One critical task in remote sensing is determining the position of a video camera relative to the scene depicted in a series of images captured by the camera. Classical approaches often necessitate pre-built scenario representations and the implementation of complex, time-consuming algorithms. Recent methods utilizing scene coordinate (SC) regression-based models have demonstrated promising performance in visual relocalization regarding accuracy and efficiency for a single scene. However, extending SC regression models to multiple scenes typically requires retraining model parameters or constructing reference landmarks, which is a time-consuming process. This paper proposes representing multiple scenes within a global reference coordinate system to efficiently train a single SC regression model in one training procedure. We encode scene information in scene embeddings as a prior condition for our model predictions. We design a scene-conditional regression-adjust (SCRA) module to adapt the model to the scene embedding by dynamically generating parameters during inference. Additionally, we employ modulation and complement modules to enhance the model's prediction applicability at both the image sample and scene levels. The modulation module adjusts the amplitude, phase, and frequency of the data flow for each input image, while the complement module derives scene-specific coordinate biases to reduce distribution differences between scenes. Extensive experiments on indoor and outdoor datasets validate our model's efficiency and accuracy in multi-scene visual relocalization. Compared to the state-of-the-art MS-Transformer model, our model requires less training time and achieves more accurate relocalization results, with reductions in average median errors of position and rotation by 50.0% and 52.0% on the Cambridge Landmarks dataset, and by 61.1% and 73.9% on the 7Scenes dataset. Compared to the separately trained advanced FeatLoc++Au model, our model achieves relative improvements in average median errors of position and rotation by 64.6% and 81.0% on the Cambridge Landmarks dataset, by 50.0% and 67.7% on the 7Scenes dataset, and by 73.7% and 41.5% on the 12Scenes dataset. We release our source code at https://github.com/AlcibiadesTophetScipio/SCINR.

Original languageEnglish
Article number111791
JournalPattern Recognition
Volume168
DOIs
StatePublished - Dec 2025

Keywords

  • Conditional adaption
  • Implicit neural representation
  • Scene coordinate prediction
  • Visual relocalization

Fingerprint

Dive into the research topics of 'Implicit neural representation model for camera relocalization in multiple scenes'. Together they form a unique fingerprint.

Cite this