Aggregating multi-scale flow-enhanced information in transformer for video inpainting

Guanxiao Li; Ke Zhang; Yu Su; Jingyu Wang

doi:10.1007/s00530-024-01625-0

Aggregating multi-scale flow-enhanced information in transformer for video inpainting

Guanxiao Li, Ke Zhang, Yu Su, Jingyu Wang

School of Astronautics

Northwestern Polytechnical University Xian

Research output: Contribution to journal › Article › peer-review

Abstract

A critical challenge in video inpainting is effectively incorporating temporally coherent information into missing regions across frames, necessitating both local and non-local temporal receptive fields. Current methods attempt to overcome this challenge by utilizing either optical-flow guidance or a single-scale self-attention mechanism. However, flow-based methods are overly reliant on accurate flow estimation and suffer from low efficiency, while attention-based methods lack dependency learning between multi-scale features, resulting in blurring in complex scenes. To address these issues, we propose a Multi-Scale Flow-enhanced Transformer Network (MSFT-Net) that combines flow guidance with a multi-scale Transformer. We have restructured both flow-based and attention-based methods, integrating them into an end-to-end model with complementary strengths. Specifically, the MSFT-Net first completes optical flows using a Fast Fourier Convolution (FFC) based network with large frequency domain receptive fields. Subsequently, guided by the completed flows, our proposed second-order deformable feature aggregations effectively aggregate content across local frames. Finally, we design the Window-based Multi-Scale temporal (WMS) Transformer to incorporate non-local features into missing regions by modeling multi-scale self-attention. Experimental results on the YouTube-VOS and DAVIS datasets demonstrate that the MSFT-Net outperforms previous methods while maintaining efficiency. By addressing the limitations of existing approaches and combining the advantages of flow-based and attention-based methods, our MSFT-Net provides a more robust and accurate solution for video inpainting tasks.

Original language	English
Article number	32
Journal	Multimedia Systems
Volume	31
Issue number	1
DOIs	https://doi.org/10.1007/s00530-024-01625-0
State	Published - Feb 2025

Keywords

Multi-scale transformer
Optical-flow guidance
Video inpainting

Access to Document

10.1007/s00530-024-01625-0

Cite this

@article{c5a9e83a0e5142818c149714afa1d3de,

title = "Aggregating multi-scale flow-enhanced information in transformer for video inpainting",

abstract = "A critical challenge in video inpainting is effectively incorporating temporally coherent information into missing regions across frames, necessitating both local and non-local temporal receptive fields. Current methods attempt to overcome this challenge by utilizing either optical-flow guidance or a single-scale self-attention mechanism. However, flow-based methods are overly reliant on accurate flow estimation and suffer from low efficiency, while attention-based methods lack dependency learning between multi-scale features, resulting in blurring in complex scenes. To address these issues, we propose a Multi-Scale Flow-enhanced Transformer Network (MSFT-Net) that combines flow guidance with a multi-scale Transformer. We have restructured both flow-based and attention-based methods, integrating them into an end-to-end model with complementary strengths. Specifically, the MSFT-Net first completes optical flows using a Fast Fourier Convolution (FFC) based network with large frequency domain receptive fields. Subsequently, guided by the completed flows, our proposed second-order deformable feature aggregations effectively aggregate content across local frames. Finally, we design the Window-based Multi-Scale temporal (WMS) Transformer to incorporate non-local features into missing regions by modeling multi-scale self-attention. Experimental results on the YouTube-VOS and DAVIS datasets demonstrate that the MSFT-Net outperforms previous methods while maintaining efficiency. By addressing the limitations of existing approaches and combining the advantages of flow-based and attention-based methods, our MSFT-Net provides a more robust and accurate solution for video inpainting tasks.",

keywords = "Multi-scale transformer, Optical-flow guidance, Video inpainting",

author = "Guanxiao Li and Ke Zhang and Yu Su and Jingyu Wang",

note = "Publisher Copyright: {\textcopyright} The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.",

year = "2025",

month = feb,

doi = "10.1007/s00530-024-01625-0",

language = "英语",

volume = "31",

journal = "Multimedia Systems",

issn = "0942-4962",

publisher = "Springer Verlag",

number = "1",

}

TY - JOUR

T1 - Aggregating multi-scale flow-enhanced information in transformer for video inpainting

AU - Li, Guanxiao

AU - Zhang, Ke

AU - Su, Yu

AU - Wang, Jingyu

N1 - Publisher Copyright: © The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.

PY - 2025/2

Y1 - 2025/2

N2 - A critical challenge in video inpainting is effectively incorporating temporally coherent information into missing regions across frames, necessitating both local and non-local temporal receptive fields. Current methods attempt to overcome this challenge by utilizing either optical-flow guidance or a single-scale self-attention mechanism. However, flow-based methods are overly reliant on accurate flow estimation and suffer from low efficiency, while attention-based methods lack dependency learning between multi-scale features, resulting in blurring in complex scenes. To address these issues, we propose a Multi-Scale Flow-enhanced Transformer Network (MSFT-Net) that combines flow guidance with a multi-scale Transformer. We have restructured both flow-based and attention-based methods, integrating them into an end-to-end model with complementary strengths. Specifically, the MSFT-Net first completes optical flows using a Fast Fourier Convolution (FFC) based network with large frequency domain receptive fields. Subsequently, guided by the completed flows, our proposed second-order deformable feature aggregations effectively aggregate content across local frames. Finally, we design the Window-based Multi-Scale temporal (WMS) Transformer to incorporate non-local features into missing regions by modeling multi-scale self-attention. Experimental results on the YouTube-VOS and DAVIS datasets demonstrate that the MSFT-Net outperforms previous methods while maintaining efficiency. By addressing the limitations of existing approaches and combining the advantages of flow-based and attention-based methods, our MSFT-Net provides a more robust and accurate solution for video inpainting tasks.

AB - A critical challenge in video inpainting is effectively incorporating temporally coherent information into missing regions across frames, necessitating both local and non-local temporal receptive fields. Current methods attempt to overcome this challenge by utilizing either optical-flow guidance or a single-scale self-attention mechanism. However, flow-based methods are overly reliant on accurate flow estimation and suffer from low efficiency, while attention-based methods lack dependency learning between multi-scale features, resulting in blurring in complex scenes. To address these issues, we propose a Multi-Scale Flow-enhanced Transformer Network (MSFT-Net) that combines flow guidance with a multi-scale Transformer. We have restructured both flow-based and attention-based methods, integrating them into an end-to-end model with complementary strengths. Specifically, the MSFT-Net first completes optical flows using a Fast Fourier Convolution (FFC) based network with large frequency domain receptive fields. Subsequently, guided by the completed flows, our proposed second-order deformable feature aggregations effectively aggregate content across local frames. Finally, we design the Window-based Multi-Scale temporal (WMS) Transformer to incorporate non-local features into missing regions by modeling multi-scale self-attention. Experimental results on the YouTube-VOS and DAVIS datasets demonstrate that the MSFT-Net outperforms previous methods while maintaining efficiency. By addressing the limitations of existing approaches and combining the advantages of flow-based and attention-based methods, our MSFT-Net provides a more robust and accurate solution for video inpainting tasks.

KW - Multi-scale transformer

KW - Optical-flow guidance

KW - Video inpainting

UR - http://www.scopus.com/inward/record.url?scp=85213712132&partnerID=8YFLogxK

U2 - 10.1007/s00530-024-01625-0

DO - 10.1007/s00530-024-01625-0

M3 - 文章

AN - SCOPUS:85213712132

SN - 0942-4962

VL - 31

JO - Multimedia Systems

JF - Multimedia Systems

IS - 1

M1 - 32

ER -

Aggregating multi-scale flow-enhanced information in transformer for video inpainting

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this