Feature pre-inpainting enhanced transformer for video inpainting

Guanxiao Li, Ke Zhang, Yu Su, Jingyu Wang

Research output: Contribution to journalArticlepeer-review

7 Scopus citations

Abstract

Transformer-based video inpainting methods aggregate coherent contents into missing regions by learning dependencies spatial–temporally. However, existing methods suffer from the inaccurate self-attention calculation and excessive quadratic computational complexity, due to uninformative representations of missing regions and inefficient global self-attention mechanisms, respectively. To mitigate these problems, we propose a Feature pre-Inpainting enhanced Transformer (FITer) video inpainting method, in which the feature pre-inpainting network (FPNet) and local–global interleaving Transformer are designed. The FPNet pre-inpaints missing features before the Transformer by exploiting spatial context, and the representations of missing regions are thus enhanced with more informative content. Therefore, the interleaving Transformer can calculate more accurate self-attention weights and learns more effective dependencies between missing and valid regions. Since the interleaving Transformer involves both global and window-based local self-attention mechanisms, the proposed FITer method can effectively aggregate spatial–temporal features into missing regions while improving efficiency. Experiments on YouTube-VOS and DAVIS datasets demonstrate that the FITer method outperforms previous methods qualitatively and quantitatively.

Original languageEnglish
Article number106323
JournalEngineering Applications of Artificial Intelligence
Volume123
DOIs
StatePublished - Aug 2023

Keywords

  • Feature pre-inpainting
  • Local–global interleaving transformer
  • Video inpainting

Fingerprint

Dive into the research topics of 'Feature pre-inpainting enhanced transformer for video inpainting'. Together they form a unique fingerprint.

Cite this