Implicit neural representation model for camera relocalization in multiple scenes

Shun Yao; Yongmei Cheng; Fei Yang; Mikhail G. Mozerov

doi:10.1016/j.patcog.2025.111791

Implicit neural representation model for camera relocalization in multiple scenes

Shun Yao, Yongmei Cheng, Fei Yang, Mikhail G. Mozerov

School of Automation

Research output: Contribution to journal › Article › peer-review

Abstract

One critical task in remote sensing is determining the position of a video camera relative to the scene depicted in a series of images captured by the camera. Classical approaches often necessitate pre-built scenario representations and the implementation of complex, time-consuming algorithms. Recent methods utilizing scene coordinate (SC) regression-based models have demonstrated promising performance in visual relocalization regarding accuracy and efficiency for a single scene. However, extending SC regression models to multiple scenes typically requires retraining model parameters or constructing reference landmarks, which is a time-consuming process. This paper proposes representing multiple scenes within a global reference coordinate system to efficiently train a single SC regression model in one training procedure. We encode scene information in scene embeddings as a prior condition for our model predictions. We design a scene-conditional regression-adjust (SCRA) module to adapt the model to the scene embedding by dynamically generating parameters during inference. Additionally, we employ modulation and complement modules to enhance the model's prediction applicability at both the image sample and scene levels. The modulation module adjusts the amplitude, phase, and frequency of the data flow for each input image, while the complement module derives scene-specific coordinate biases to reduce distribution differences between scenes. Extensive experiments on indoor and outdoor datasets validate our model's efficiency and accuracy in multi-scene visual relocalization. Compared to the state-of-the-art MS-Transformer model, our model requires less training time and achieves more accurate relocalization results, with reductions in average median errors of position and rotation by 50.0% and 52.0% on the Cambridge Landmarks dataset, and by 61.1% and 73.9% on the 7Scenes dataset. Compared to the separately trained advanced FeatLoc++Au model, our model achieves relative improvements in average median errors of position and rotation by 64.6% and 81.0% on the Cambridge Landmarks dataset, by 50.0% and 67.7% on the 7Scenes dataset, and by 73.7% and 41.5% on the 12Scenes dataset. We release our source code at https://github.com/AlcibiadesTophetScipio/SCINR.

Original language	English
Article number	111791
Journal	Pattern Recognition
Volume	168
DOIs	https://doi.org/10.1016/j.patcog.2025.111791
State	Published - Dec 2025

Keywords

Conditional adaption
Implicit neural representation
Scene coordinate prediction
Visual relocalization

Access to Document

10.1016/j.patcog.2025.111791

Cite this

@article{b6f4fe01579d47c6bbf6085227af94cb,

title = "Implicit neural representation model for camera relocalization in multiple scenes",

abstract = "One critical task in remote sensing is determining the position of a video camera relative to the scene depicted in a series of images captured by the camera. Classical approaches often necessitate pre-built scenario representations and the implementation of complex, time-consuming algorithms. Recent methods utilizing scene coordinate (SC) regression-based models have demonstrated promising performance in visual relocalization regarding accuracy and efficiency for a single scene. However, extending SC regression models to multiple scenes typically requires retraining model parameters or constructing reference landmarks, which is a time-consuming process. This paper proposes representing multiple scenes within a global reference coordinate system to efficiently train a single SC regression model in one training procedure. We encode scene information in scene embeddings as a prior condition for our model predictions. We design a scene-conditional regression-adjust (SCRA) module to adapt the model to the scene embedding by dynamically generating parameters during inference. Additionally, we employ modulation and complement modules to enhance the model's prediction applicability at both the image sample and scene levels. The modulation module adjusts the amplitude, phase, and frequency of the data flow for each input image, while the complement module derives scene-specific coordinate biases to reduce distribution differences between scenes. Extensive experiments on indoor and outdoor datasets validate our model's efficiency and accuracy in multi-scene visual relocalization. Compared to the state-of-the-art MS-Transformer model, our model requires less training time and achieves more accurate relocalization results, with reductions in average median errors of position and rotation by 50.0% and 52.0% on the Cambridge Landmarks dataset, and by 61.1% and 73.9% on the 7Scenes dataset. Compared to the separately trained advanced FeatLoc++Au model, our model achieves relative improvements in average median errors of position and rotation by 64.6% and 81.0% on the Cambridge Landmarks dataset, by 50.0% and 67.7% on the 7Scenes dataset, and by 73.7% and 41.5% on the 12Scenes dataset. We release our source code at https://github.com/AlcibiadesTophetScipio/SCINR.",

keywords = "Conditional adaption, Implicit neural representation, Scene coordinate prediction, Visual relocalization",

author = "Shun Yao and Yongmei Cheng and Fei Yang and Mozerov, {Mikhail G.}",

note = "Publisher Copyright: {\textcopyright} 2025",

year = "2025",

month = dec,

doi = "10.1016/j.patcog.2025.111791",

language = "英语",

volume = "168",

journal = "Pattern Recognition",

issn = "0031-3203",

publisher = "Elsevier Ltd",

}

TY - JOUR

T1 - Implicit neural representation model for camera relocalization in multiple scenes

AU - Yao, Shun

AU - Cheng, Yongmei

AU - Yang, Fei

AU - Mozerov, Mikhail G.

PY - 2025/12

Y1 - 2025/12

N2 - One critical task in remote sensing is determining the position of a video camera relative to the scene depicted in a series of images captured by the camera. Classical approaches often necessitate pre-built scenario representations and the implementation of complex, time-consuming algorithms. Recent methods utilizing scene coordinate (SC) regression-based models have demonstrated promising performance in visual relocalization regarding accuracy and efficiency for a single scene. However, extending SC regression models to multiple scenes typically requires retraining model parameters or constructing reference landmarks, which is a time-consuming process. This paper proposes representing multiple scenes within a global reference coordinate system to efficiently train a single SC regression model in one training procedure. We encode scene information in scene embeddings as a prior condition for our model predictions. We design a scene-conditional regression-adjust (SCRA) module to adapt the model to the scene embedding by dynamically generating parameters during inference. Additionally, we employ modulation and complement modules to enhance the model's prediction applicability at both the image sample and scene levels. The modulation module adjusts the amplitude, phase, and frequency of the data flow for each input image, while the complement module derives scene-specific coordinate biases to reduce distribution differences between scenes. Extensive experiments on indoor and outdoor datasets validate our model's efficiency and accuracy in multi-scene visual relocalization. Compared to the state-of-the-art MS-Transformer model, our model requires less training time and achieves more accurate relocalization results, with reductions in average median errors of position and rotation by 50.0% and 52.0% on the Cambridge Landmarks dataset, and by 61.1% and 73.9% on the 7Scenes dataset. Compared to the separately trained advanced FeatLoc++Au model, our model achieves relative improvements in average median errors of position and rotation by 64.6% and 81.0% on the Cambridge Landmarks dataset, by 50.0% and 67.7% on the 7Scenes dataset, and by 73.7% and 41.5% on the 12Scenes dataset. We release our source code at https://github.com/AlcibiadesTophetScipio/SCINR.

AB - One critical task in remote sensing is determining the position of a video camera relative to the scene depicted in a series of images captured by the camera. Classical approaches often necessitate pre-built scenario representations and the implementation of complex, time-consuming algorithms. Recent methods utilizing scene coordinate (SC) regression-based models have demonstrated promising performance in visual relocalization regarding accuracy and efficiency for a single scene. However, extending SC regression models to multiple scenes typically requires retraining model parameters or constructing reference landmarks, which is a time-consuming process. This paper proposes representing multiple scenes within a global reference coordinate system to efficiently train a single SC regression model in one training procedure. We encode scene information in scene embeddings as a prior condition for our model predictions. We design a scene-conditional regression-adjust (SCRA) module to adapt the model to the scene embedding by dynamically generating parameters during inference. Additionally, we employ modulation and complement modules to enhance the model's prediction applicability at both the image sample and scene levels. The modulation module adjusts the amplitude, phase, and frequency of the data flow for each input image, while the complement module derives scene-specific coordinate biases to reduce distribution differences between scenes. Extensive experiments on indoor and outdoor datasets validate our model's efficiency and accuracy in multi-scene visual relocalization. Compared to the state-of-the-art MS-Transformer model, our model requires less training time and achieves more accurate relocalization results, with reductions in average median errors of position and rotation by 50.0% and 52.0% on the Cambridge Landmarks dataset, and by 61.1% and 73.9% on the 7Scenes dataset. Compared to the separately trained advanced FeatLoc++Au model, our model achieves relative improvements in average median errors of position and rotation by 64.6% and 81.0% on the Cambridge Landmarks dataset, by 50.0% and 67.7% on the 7Scenes dataset, and by 73.7% and 41.5% on the 12Scenes dataset. We release our source code at https://github.com/AlcibiadesTophetScipio/SCINR.

KW - Conditional adaption

KW - Implicit neural representation

KW - Scene coordinate prediction

KW - Visual relocalization

UR - http://www.scopus.com/inward/record.url?scp=105005273150&partnerID=8YFLogxK

U2 - 10.1016/j.patcog.2025.111791

DO - 10.1016/j.patcog.2025.111791

M3 - 文章

AN - SCOPUS:105005273150

SN - 0031-3203

VL - 168

JO - Pattern Recognition

JF - Pattern Recognition

M1 - 111791

ER -

Implicit neural representation model for camera relocalization in multiple scenes

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this