WTVI: A Wavelet-Based Transformer Network for Video Inpainting

Ke Zhang; Guanxiao Li; Yu Su; Jingyu Wang

doi:10.1109/LSP.2024.3361805

WTVI: A Wavelet-Based Transformer Network for Video Inpainting

Ke Zhang, Guanxiao Li, Yu Su, Jingyu Wang

School of Astronautics

Northwestern Polytechnical University Xian

Research output: Contribution to journal › Article › peer-review

5 Scopus citations

Abstract

Video inpainting aims to complete missing frames visually convincingly by balancing high-frequency detailed textures and low-frequency semantic structures. Conventional approaches utilize generative adversarial and reconstruction losses for optimizing output frames, each favoring different frequency aspects, to achieve this equilibrium. However, employing both loss types concurrently often results in a conflict between perceptual and distortion qualities, mainly due to their distinct frequency preferences. In response, this letter introduces the Wavelet-based Transformer network for Video Inpainting (WTVI). WTVI employs a 2D discrete wavelet transform (DWT) to decompose frames into various frequency bands, ensuring the preservation of spatial information. It then independently completes missing regions in each band using Transformer network. To mitigate inter-frequency conflicts, we apply reconstruction loss to the low-frequency bands and adversarial loss to the high-frequency bands. Additionally, we innovate High-frequency Cross-Attention (HCA) and Low-frequency Cross-Attention (LCA) modules to enhance frequency dependency learning beyond the spatial-temporal scope and to align features across bands. Our experiments confirm that WTVI surpasses previous methods, significantly improving both quantitative and qualitative performance.

Original language	English
Pages (from-to)	616-620
Number of pages	5
Journal	IEEE Signal Processing Letters
Volume	31
DOIs	https://doi.org/10.1109/LSP.2024.3361805
State	Published - 2024

Keywords

DWT
Video inpainting
vision transformer

Access to Document

10.1109/LSP.2024.3361805

Cite this

@article{5dd45102c54241bab97f1d55f77a8486,

title = "WTVI: A Wavelet-Based Transformer Network for Video Inpainting",

abstract = "Video inpainting aims to complete missing frames visually convincingly by balancing high-frequency detailed textures and low-frequency semantic structures. Conventional approaches utilize generative adversarial and reconstruction losses for optimizing output frames, each favoring different frequency aspects, to achieve this equilibrium. However, employing both loss types concurrently often results in a conflict between perceptual and distortion qualities, mainly due to their distinct frequency preferences. In response, this letter introduces the Wavelet-based Transformer network for Video Inpainting (WTVI). WTVI employs a 2D discrete wavelet transform (DWT) to decompose frames into various frequency bands, ensuring the preservation of spatial information. It then independently completes missing regions in each band using Transformer network. To mitigate inter-frequency conflicts, we apply reconstruction loss to the low-frequency bands and adversarial loss to the high-frequency bands. Additionally, we innovate High-frequency Cross-Attention (HCA) and Low-frequency Cross-Attention (LCA) modules to enhance frequency dependency learning beyond the spatial-temporal scope and to align features across bands. Our experiments confirm that WTVI surpasses previous methods, significantly improving both quantitative and qualitative performance.",

keywords = "DWT, Video inpainting, vision transformer",

author = "Ke Zhang and Guanxiao Li and Yu Su and Jingyu Wang",

note = "Publisher Copyright: {\textcopyright} 1994-2012 IEEE.",

year = "2024",

doi = "10.1109/LSP.2024.3361805",

language = "英语",

volume = "31",

pages = "616--620",

journal = "IEEE Signal Processing Letters",

issn = "1070-9908",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - WTVI

T2 - A Wavelet-Based Transformer Network for Video Inpainting

AU - Zhang, Ke

AU - Li, Guanxiao

AU - Su, Yu

AU - Wang, Jingyu

PY - 2024

Y1 - 2024

N2 - Video inpainting aims to complete missing frames visually convincingly by balancing high-frequency detailed textures and low-frequency semantic structures. Conventional approaches utilize generative adversarial and reconstruction losses for optimizing output frames, each favoring different frequency aspects, to achieve this equilibrium. However, employing both loss types concurrently often results in a conflict between perceptual and distortion qualities, mainly due to their distinct frequency preferences. In response, this letter introduces the Wavelet-based Transformer network for Video Inpainting (WTVI). WTVI employs a 2D discrete wavelet transform (DWT) to decompose frames into various frequency bands, ensuring the preservation of spatial information. It then independently completes missing regions in each band using Transformer network. To mitigate inter-frequency conflicts, we apply reconstruction loss to the low-frequency bands and adversarial loss to the high-frequency bands. Additionally, we innovate High-frequency Cross-Attention (HCA) and Low-frequency Cross-Attention (LCA) modules to enhance frequency dependency learning beyond the spatial-temporal scope and to align features across bands. Our experiments confirm that WTVI surpasses previous methods, significantly improving both quantitative and qualitative performance.

AB - Video inpainting aims to complete missing frames visually convincingly by balancing high-frequency detailed textures and low-frequency semantic structures. Conventional approaches utilize generative adversarial and reconstruction losses for optimizing output frames, each favoring different frequency aspects, to achieve this equilibrium. However, employing both loss types concurrently often results in a conflict between perceptual and distortion qualities, mainly due to their distinct frequency preferences. In response, this letter introduces the Wavelet-based Transformer network for Video Inpainting (WTVI). WTVI employs a 2D discrete wavelet transform (DWT) to decompose frames into various frequency bands, ensuring the preservation of spatial information. It then independently completes missing regions in each band using Transformer network. To mitigate inter-frequency conflicts, we apply reconstruction loss to the low-frequency bands and adversarial loss to the high-frequency bands. Additionally, we innovate High-frequency Cross-Attention (HCA) and Low-frequency Cross-Attention (LCA) modules to enhance frequency dependency learning beyond the spatial-temporal scope and to align features across bands. Our experiments confirm that WTVI surpasses previous methods, significantly improving both quantitative and qualitative performance.

KW - DWT

KW - Video inpainting

KW - vision transformer

UR - http://www.scopus.com/inward/record.url?scp=85184312940&partnerID=8YFLogxK

U2 - 10.1109/LSP.2024.3361805

DO - 10.1109/LSP.2024.3361805

M3 - 文章

AN - SCOPUS:85184312940

SN - 1070-9908

VL - 31

SP - 616

EP - 620

JO - IEEE Signal Processing Letters

JF - IEEE Signal Processing Letters

ER -

WTVI: A Wavelet-Based Transformer Network for Video Inpainting

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this