WTVI: A Wavelet-Based Transformer Network for Video Inpainting

Ke Zhang, Guanxiao Li, Yu Su, Jingyu Wang

Research output: Contribution to journalArticlepeer-review

5 Scopus citations

Abstract

Video inpainting aims to complete missing frames visually convincingly by balancing high-frequency detailed textures and low-frequency semantic structures. Conventional approaches utilize generative adversarial and reconstruction losses for optimizing output frames, each favoring different frequency aspects, to achieve this equilibrium. However, employing both loss types concurrently often results in a conflict between perceptual and distortion qualities, mainly due to their distinct frequency preferences. In response, this letter introduces the Wavelet-based Transformer network for Video Inpainting (WTVI). WTVI employs a 2D discrete wavelet transform (DWT) to decompose frames into various frequency bands, ensuring the preservation of spatial information. It then independently completes missing regions in each band using Transformer network. To mitigate inter-frequency conflicts, we apply reconstruction loss to the low-frequency bands and adversarial loss to the high-frequency bands. Additionally, we innovate High-frequency Cross-Attention (HCA) and Low-frequency Cross-Attention (LCA) modules to enhance frequency dependency learning beyond the spatial-temporal scope and to align features across bands. Our experiments confirm that WTVI surpasses previous methods, significantly improving both quantitative and qualitative performance.

Original languageEnglish
Pages (from-to)616-620
Number of pages5
JournalIEEE Signal Processing Letters
Volume31
DOIs
StatePublished - 2024

Keywords

  • DWT
  • Video inpainting
  • vision transformer

Fingerprint

Dive into the research topics of 'WTVI: A Wavelet-Based Transformer Network for Video Inpainting'. Together they form a unique fingerprint.

Cite this