Aggregating multi-scale flow-enhanced information in transformer for video inpainting

Guanxiao Li, Ke Zhang, Yu Su, Jingyu Wang

Research output: Contribution to journalArticlepeer-review

Abstract

A critical challenge in video inpainting is effectively incorporating temporally coherent information into missing regions across frames, necessitating both local and non-local temporal receptive fields. Current methods attempt to overcome this challenge by utilizing either optical-flow guidance or a single-scale self-attention mechanism. However, flow-based methods are overly reliant on accurate flow estimation and suffer from low efficiency, while attention-based methods lack dependency learning between multi-scale features, resulting in blurring in complex scenes. To address these issues, we propose a Multi-Scale Flow-enhanced Transformer Network (MSFT-Net) that combines flow guidance with a multi-scale Transformer. We have restructured both flow-based and attention-based methods, integrating them into an end-to-end model with complementary strengths. Specifically, the MSFT-Net first completes optical flows using a Fast Fourier Convolution (FFC) based network with large frequency domain receptive fields. Subsequently, guided by the completed flows, our proposed second-order deformable feature aggregations effectively aggregate content across local frames. Finally, we design the Window-based Multi-Scale temporal (WMS) Transformer to incorporate non-local features into missing regions by modeling multi-scale self-attention. Experimental results on the YouTube-VOS and DAVIS datasets demonstrate that the MSFT-Net outperforms previous methods while maintaining efficiency. By addressing the limitations of existing approaches and combining the advantages of flow-based and attention-based methods, our MSFT-Net provides a more robust and accurate solution for video inpainting tasks.

Original languageEnglish
Article number32
JournalMultimedia Systems
Volume31
Issue number1
DOIs
StatePublished - Feb 2025

Keywords

  • Multi-scale transformer
  • Optical-flow guidance
  • Video inpainting

Fingerprint

Dive into the research topics of 'Aggregating multi-scale flow-enhanced information in transformer for video inpainting'. Together they form a unique fingerprint.

Cite this