TY - JOUR
T1 - Aggregating multi-scale flow-enhanced information in transformer for video inpainting
AU - Li, Guanxiao
AU - Zhang, Ke
AU - Su, Yu
AU - Wang, Jingyu
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.
PY - 2025/2
Y1 - 2025/2
N2 - A critical challenge in video inpainting is effectively incorporating temporally coherent information into missing regions across frames, necessitating both local and non-local temporal receptive fields. Current methods attempt to overcome this challenge by utilizing either optical-flow guidance or a single-scale self-attention mechanism. However, flow-based methods are overly reliant on accurate flow estimation and suffer from low efficiency, while attention-based methods lack dependency learning between multi-scale features, resulting in blurring in complex scenes. To address these issues, we propose a Multi-Scale Flow-enhanced Transformer Network (MSFT-Net) that combines flow guidance with a multi-scale Transformer. We have restructured both flow-based and attention-based methods, integrating them into an end-to-end model with complementary strengths. Specifically, the MSFT-Net first completes optical flows using a Fast Fourier Convolution (FFC) based network with large frequency domain receptive fields. Subsequently, guided by the completed flows, our proposed second-order deformable feature aggregations effectively aggregate content across local frames. Finally, we design the Window-based Multi-Scale temporal (WMS) Transformer to incorporate non-local features into missing regions by modeling multi-scale self-attention. Experimental results on the YouTube-VOS and DAVIS datasets demonstrate that the MSFT-Net outperforms previous methods while maintaining efficiency. By addressing the limitations of existing approaches and combining the advantages of flow-based and attention-based methods, our MSFT-Net provides a more robust and accurate solution for video inpainting tasks.
AB - A critical challenge in video inpainting is effectively incorporating temporally coherent information into missing regions across frames, necessitating both local and non-local temporal receptive fields. Current methods attempt to overcome this challenge by utilizing either optical-flow guidance or a single-scale self-attention mechanism. However, flow-based methods are overly reliant on accurate flow estimation and suffer from low efficiency, while attention-based methods lack dependency learning between multi-scale features, resulting in blurring in complex scenes. To address these issues, we propose a Multi-Scale Flow-enhanced Transformer Network (MSFT-Net) that combines flow guidance with a multi-scale Transformer. We have restructured both flow-based and attention-based methods, integrating them into an end-to-end model with complementary strengths. Specifically, the MSFT-Net first completes optical flows using a Fast Fourier Convolution (FFC) based network with large frequency domain receptive fields. Subsequently, guided by the completed flows, our proposed second-order deformable feature aggregations effectively aggregate content across local frames. Finally, we design the Window-based Multi-Scale temporal (WMS) Transformer to incorporate non-local features into missing regions by modeling multi-scale self-attention. Experimental results on the YouTube-VOS and DAVIS datasets demonstrate that the MSFT-Net outperforms previous methods while maintaining efficiency. By addressing the limitations of existing approaches and combining the advantages of flow-based and attention-based methods, our MSFT-Net provides a more robust and accurate solution for video inpainting tasks.
KW - Multi-scale transformer
KW - Optical-flow guidance
KW - Video inpainting
UR - http://www.scopus.com/inward/record.url?scp=85213712132&partnerID=8YFLogxK
U2 - 10.1007/s00530-024-01625-0
DO - 10.1007/s00530-024-01625-0
M3 - 文章
AN - SCOPUS:85213712132
SN - 0942-4962
VL - 31
JO - Multimedia Systems
JF - Multimedia Systems
IS - 1
M1 - 32
ER -