TY - JOUR
T1 - Implicit neural representation model for camera relocalization in multiple scenes
AU - Yao, Shun
AU - Cheng, Yongmei
AU - Yang, Fei
AU - Mozerov, Mikhail G.
N1 - Publisher Copyright:
© 2025
PY - 2025/12
Y1 - 2025/12
N2 - One critical task in remote sensing is determining the position of a video camera relative to the scene depicted in a series of images captured by the camera. Classical approaches often necessitate pre-built scenario representations and the implementation of complex, time-consuming algorithms. Recent methods utilizing scene coordinate (SC) regression-based models have demonstrated promising performance in visual relocalization regarding accuracy and efficiency for a single scene. However, extending SC regression models to multiple scenes typically requires retraining model parameters or constructing reference landmarks, which is a time-consuming process. This paper proposes representing multiple scenes within a global reference coordinate system to efficiently train a single SC regression model in one training procedure. We encode scene information in scene embeddings as a prior condition for our model predictions. We design a scene-conditional regression-adjust (SCRA) module to adapt the model to the scene embedding by dynamically generating parameters during inference. Additionally, we employ modulation and complement modules to enhance the model's prediction applicability at both the image sample and scene levels. The modulation module adjusts the amplitude, phase, and frequency of the data flow for each input image, while the complement module derives scene-specific coordinate biases to reduce distribution differences between scenes. Extensive experiments on indoor and outdoor datasets validate our model's efficiency and accuracy in multi-scene visual relocalization. Compared to the state-of-the-art MS-Transformer model, our model requires less training time and achieves more accurate relocalization results, with reductions in average median errors of position and rotation by 50.0% and 52.0% on the Cambridge Landmarks dataset, and by 61.1% and 73.9% on the 7Scenes dataset. Compared to the separately trained advanced FeatLoc++Au model, our model achieves relative improvements in average median errors of position and rotation by 64.6% and 81.0% on the Cambridge Landmarks dataset, by 50.0% and 67.7% on the 7Scenes dataset, and by 73.7% and 41.5% on the 12Scenes dataset. We release our source code at https://github.com/AlcibiadesTophetScipio/SCINR.
AB - One critical task in remote sensing is determining the position of a video camera relative to the scene depicted in a series of images captured by the camera. Classical approaches often necessitate pre-built scenario representations and the implementation of complex, time-consuming algorithms. Recent methods utilizing scene coordinate (SC) regression-based models have demonstrated promising performance in visual relocalization regarding accuracy and efficiency for a single scene. However, extending SC regression models to multiple scenes typically requires retraining model parameters or constructing reference landmarks, which is a time-consuming process. This paper proposes representing multiple scenes within a global reference coordinate system to efficiently train a single SC regression model in one training procedure. We encode scene information in scene embeddings as a prior condition for our model predictions. We design a scene-conditional regression-adjust (SCRA) module to adapt the model to the scene embedding by dynamically generating parameters during inference. Additionally, we employ modulation and complement modules to enhance the model's prediction applicability at both the image sample and scene levels. The modulation module adjusts the amplitude, phase, and frequency of the data flow for each input image, while the complement module derives scene-specific coordinate biases to reduce distribution differences between scenes. Extensive experiments on indoor and outdoor datasets validate our model's efficiency and accuracy in multi-scene visual relocalization. Compared to the state-of-the-art MS-Transformer model, our model requires less training time and achieves more accurate relocalization results, with reductions in average median errors of position and rotation by 50.0% and 52.0% on the Cambridge Landmarks dataset, and by 61.1% and 73.9% on the 7Scenes dataset. Compared to the separately trained advanced FeatLoc++Au model, our model achieves relative improvements in average median errors of position and rotation by 64.6% and 81.0% on the Cambridge Landmarks dataset, by 50.0% and 67.7% on the 7Scenes dataset, and by 73.7% and 41.5% on the 12Scenes dataset. We release our source code at https://github.com/AlcibiadesTophetScipio/SCINR.
KW - Conditional adaption
KW - Implicit neural representation
KW - Scene coordinate prediction
KW - Visual relocalization
UR - http://www.scopus.com/inward/record.url?scp=105005273150&partnerID=8YFLogxK
U2 - 10.1016/j.patcog.2025.111791
DO - 10.1016/j.patcog.2025.111791
M3 - 文章
AN - SCOPUS:105005273150
SN - 0031-3203
VL - 168
JO - Pattern Recognition
JF - Pattern Recognition
M1 - 111791
ER -