Abstract
Video inpainting aims to complete missing frames visually convincingly by balancing high-frequency detailed textures and low-frequency semantic structures. Conventional approaches utilize generative adversarial and reconstruction losses for optimizing output frames, each favoring different frequency aspects, to achieve this equilibrium. However, employing both loss types concurrently often results in a conflict between perceptual and distortion qualities, mainly due to their distinct frequency preferences. In response, this letter introduces the Wavelet-based Transformer network for Video Inpainting (WTVI). WTVI employs a 2D discrete wavelet transform (DWT) to decompose frames into various frequency bands, ensuring the preservation of spatial information. It then independently completes missing regions in each band using Transformer network. To mitigate inter-frequency conflicts, we apply reconstruction loss to the low-frequency bands and adversarial loss to the high-frequency bands. Additionally, we innovate High-frequency Cross-Attention (HCA) and Low-frequency Cross-Attention (LCA) modules to enhance frequency dependency learning beyond the spatial-temporal scope and to align features across bands. Our experiments confirm that WTVI surpasses previous methods, significantly improving both quantitative and qualitative performance.
Original language | English |
---|---|
Pages (from-to) | 616-620 |
Number of pages | 5 |
Journal | IEEE Signal Processing Letters |
Volume | 31 |
DOIs | |
State | Published - 2024 |
Keywords
- DWT
- Video inpainting
- vision transformer