Implicit neural representation model for camera relocalization in multiple scenes

Shun Yao; Yongmei Cheng; Fei Yang; Mikhail G. Mozerov

doi:10.1016/j.patcog.2025.111791

Implicit neural representation model for camera relocalization in multiple scenes

Shun Yao, Yongmei Cheng, Fei Yang, Mikhail G. Mozerov

自动化学院

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

One critical task in remote sensing is determining the position of a video camera relative to the scene depicted in a series of images captured by the camera. Classical approaches often necessitate pre-built scenario representations and the implementation of complex, time-consuming algorithms. Recent methods utilizing scene coordinate (SC) regression-based models have demonstrated promising performance in visual relocalization regarding accuracy and efficiency for a single scene. However, extending SC regression models to multiple scenes typically requires retraining model parameters or constructing reference landmarks, which is a time-consuming process. This paper proposes representing multiple scenes within a global reference coordinate system to efficiently train a single SC regression model in one training procedure. We encode scene information in scene embeddings as a prior condition for our model predictions. We design a scene-conditional regression-adjust (SCRA) module to adapt the model to the scene embedding by dynamically generating parameters during inference. Additionally, we employ modulation and complement modules to enhance the model's prediction applicability at both the image sample and scene levels. The modulation module adjusts the amplitude, phase, and frequency of the data flow for each input image, while the complement module derives scene-specific coordinate biases to reduce distribution differences between scenes. Extensive experiments on indoor and outdoor datasets validate our model's efficiency and accuracy in multi-scene visual relocalization. Compared to the state-of-the-art MS-Transformer model, our model requires less training time and achieves more accurate relocalization results, with reductions in average median errors of position and rotation by 50.0% and 52.0% on the Cambridge Landmarks dataset, and by 61.1% and 73.9% on the 7Scenes dataset. Compared to the separately trained advanced FeatLoc++Au model, our model achieves relative improvements in average median errors of position and rotation by 64.6% and 81.0% on the Cambridge Landmarks dataset, by 50.0% and 67.7% on the 7Scenes dataset, and by 73.7% and 41.5% on the 12Scenes dataset. We release our source code at https://github.com/AlcibiadesTophetScipio/SCINR.

源语言	英语
文章编号	111791
期刊	Pattern Recognition
卷	168
DOI	https://doi.org/10.1016/j.patcog.2025.111791
出版状态	已出版 - 12月 2025

访问文件

10.1016/j.patcog.2025.111791

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{b6f4fe01579d47c6bbf6085227af94cb,

title = "Implicit neural representation model for camera relocalization in multiple scenes",

abstract = "One critical task in remote sensing is determining the position of a video camera relative to the scene depicted in a series of images captured by the camera. Classical approaches often necessitate pre-built scenario representations and the implementation of complex, time-consuming algorithms. Recent methods utilizing scene coordinate (SC) regression-based models have demonstrated promising performance in visual relocalization regarding accuracy and efficiency for a single scene. However, extending SC regression models to multiple scenes typically requires retraining model parameters or constructing reference landmarks, which is a time-consuming process. This paper proposes representing multiple scenes within a global reference coordinate system to efficiently train a single SC regression model in one training procedure. We encode scene information in scene embeddings as a prior condition for our model predictions. We design a scene-conditional regression-adjust (SCRA) module to adapt the model to the scene embedding by dynamically generating parameters during inference. Additionally, we employ modulation and complement modules to enhance the model's prediction applicability at both the image sample and scene levels. The modulation module adjusts the amplitude, phase, and frequency of the data flow for each input image, while the complement module derives scene-specific coordinate biases to reduce distribution differences between scenes. Extensive experiments on indoor and outdoor datasets validate our model's efficiency and accuracy in multi-scene visual relocalization. Compared to the state-of-the-art MS-Transformer model, our model requires less training time and achieves more accurate relocalization results, with reductions in average median errors of position and rotation by 50.0% and 52.0% on the Cambridge Landmarks dataset, and by 61.1% and 73.9% on the 7Scenes dataset. Compared to the separately trained advanced FeatLoc++Au model, our model achieves relative improvements in average median errors of position and rotation by 64.6% and 81.0% on the Cambridge Landmarks dataset, by 50.0% and 67.7% on the 7Scenes dataset, and by 73.7% and 41.5% on the 12Scenes dataset. We release our source code at https://github.com/AlcibiadesTophetScipio/SCINR.",

keywords = "Conditional adaption, Implicit neural representation, Scene coordinate prediction, Visual relocalization",

author = "Shun Yao and Yongmei Cheng and Fei Yang and Mozerov, {Mikhail G.}",

note = "Publisher Copyright: {\textcopyright} 2025",

year = "2025",

month = dec,

doi = "10.1016/j.patcog.2025.111791",

language = "英语",

volume = "168",

journal = "Pattern Recognition",

issn = "0031-3203",

publisher = "Elsevier Ltd",

}

TY - JOUR

T1 - Implicit neural representation model for camera relocalization in multiple scenes

AU - Yao, Shun

AU - Cheng, Yongmei

AU - Yang, Fei

AU - Mozerov, Mikhail G.

PY - 2025/12

Y1 - 2025/12

N2 - One critical task in remote sensing is determining the position of a video camera relative to the scene depicted in a series of images captured by the camera. Classical approaches often necessitate pre-built scenario representations and the implementation of complex, time-consuming algorithms. Recent methods utilizing scene coordinate (SC) regression-based models have demonstrated promising performance in visual relocalization regarding accuracy and efficiency for a single scene. However, extending SC regression models to multiple scenes typically requires retraining model parameters or constructing reference landmarks, which is a time-consuming process. This paper proposes representing multiple scenes within a global reference coordinate system to efficiently train a single SC regression model in one training procedure. We encode scene information in scene embeddings as a prior condition for our model predictions. We design a scene-conditional regression-adjust (SCRA) module to adapt the model to the scene embedding by dynamically generating parameters during inference. Additionally, we employ modulation and complement modules to enhance the model's prediction applicability at both the image sample and scene levels. The modulation module adjusts the amplitude, phase, and frequency of the data flow for each input image, while the complement module derives scene-specific coordinate biases to reduce distribution differences between scenes. Extensive experiments on indoor and outdoor datasets validate our model's efficiency and accuracy in multi-scene visual relocalization. Compared to the state-of-the-art MS-Transformer model, our model requires less training time and achieves more accurate relocalization results, with reductions in average median errors of position and rotation by 50.0% and 52.0% on the Cambridge Landmarks dataset, and by 61.1% and 73.9% on the 7Scenes dataset. Compared to the separately trained advanced FeatLoc++Au model, our model achieves relative improvements in average median errors of position and rotation by 64.6% and 81.0% on the Cambridge Landmarks dataset, by 50.0% and 67.7% on the 7Scenes dataset, and by 73.7% and 41.5% on the 12Scenes dataset. We release our source code at https://github.com/AlcibiadesTophetScipio/SCINR.

AB - One critical task in remote sensing is determining the position of a video camera relative to the scene depicted in a series of images captured by the camera. Classical approaches often necessitate pre-built scenario representations and the implementation of complex, time-consuming algorithms. Recent methods utilizing scene coordinate (SC) regression-based models have demonstrated promising performance in visual relocalization regarding accuracy and efficiency for a single scene. However, extending SC regression models to multiple scenes typically requires retraining model parameters or constructing reference landmarks, which is a time-consuming process. This paper proposes representing multiple scenes within a global reference coordinate system to efficiently train a single SC regression model in one training procedure. We encode scene information in scene embeddings as a prior condition for our model predictions. We design a scene-conditional regression-adjust (SCRA) module to adapt the model to the scene embedding by dynamically generating parameters during inference. Additionally, we employ modulation and complement modules to enhance the model's prediction applicability at both the image sample and scene levels. The modulation module adjusts the amplitude, phase, and frequency of the data flow for each input image, while the complement module derives scene-specific coordinate biases to reduce distribution differences between scenes. Extensive experiments on indoor and outdoor datasets validate our model's efficiency and accuracy in multi-scene visual relocalization. Compared to the state-of-the-art MS-Transformer model, our model requires less training time and achieves more accurate relocalization results, with reductions in average median errors of position and rotation by 50.0% and 52.0% on the Cambridge Landmarks dataset, and by 61.1% and 73.9% on the 7Scenes dataset. Compared to the separately trained advanced FeatLoc++Au model, our model achieves relative improvements in average median errors of position and rotation by 64.6% and 81.0% on the Cambridge Landmarks dataset, by 50.0% and 67.7% on the 7Scenes dataset, and by 73.7% and 41.5% on the 12Scenes dataset. We release our source code at https://github.com/AlcibiadesTophetScipio/SCINR.

KW - Conditional adaption

KW - Implicit neural representation

KW - Scene coordinate prediction

KW - Visual relocalization

UR - http://www.scopus.com/inward/record.url?scp=105005273150&partnerID=8YFLogxK

U2 - 10.1016/j.patcog.2025.111791

DO - 10.1016/j.patcog.2025.111791

M3 - 文章

AN - SCOPUS:105005273150

SN - 0031-3203

VL - 168

JO - Pattern Recognition

JF - Pattern Recognition

M1 - 111791

ER -

Implicit neural representation model for camera relocalization in multiple scenes

摘要

访问文件

其它文件与链接

指纹

引用此