Feature pre-inpainting enhanced transformer for video inpainting

Guanxiao Li; Ke Zhang; Yu Su; Jingyu Wang

doi:10.1016/j.engappai.2023.106323

Feature pre-inpainting enhanced transformer for video inpainting

Guanxiao Li, Ke Zhang, Yu Su, Jingyu Wang

航天学院

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 文章 › 同行评审

7 引用（Scopus）

摘要

Transformer-based video inpainting methods aggregate coherent contents into missing regions by learning dependencies spatial–temporally. However, existing methods suffer from the inaccurate self-attention calculation and excessive quadratic computational complexity, due to uninformative representations of missing regions and inefficient global self-attention mechanisms, respectively. To mitigate these problems, we propose a Feature pre-Inpainting enhanced Transformer (FITer) video inpainting method, in which the feature pre-inpainting network (FPNet) and local–global interleaving Transformer are designed. The FPNet pre-inpaints missing features before the Transformer by exploiting spatial context, and the representations of missing regions are thus enhanced with more informative content. Therefore, the interleaving Transformer can calculate more accurate self-attention weights and learns more effective dependencies between missing and valid regions. Since the interleaving Transformer involves both global and window-based local self-attention mechanisms, the proposed FITer method can effectively aggregate spatial–temporal features into missing regions while improving efficiency. Experiments on YouTube-VOS and DAVIS datasets demonstrate that the FITer method outperforms previous methods qualitatively and quantitatively.

源语言	英语
文章编号	106323
期刊	Engineering Applications of Artificial Intelligence
卷	123
DOI	https://doi.org/10.1016/j.engappai.2023.106323
出版状态	已出版 - 8月 2023

访问文件

10.1016/j.engappai.2023.106323

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{ba2d0affa8184ba59c41f9399ea9db63,

title = "Feature pre-inpainting enhanced transformer for video inpainting",

abstract = "Transformer-based video inpainting methods aggregate coherent contents into missing regions by learning dependencies spatial–temporally. However, existing methods suffer from the inaccurate self-attention calculation and excessive quadratic computational complexity, due to uninformative representations of missing regions and inefficient global self-attention mechanisms, respectively. To mitigate these problems, we propose a Feature pre-Inpainting enhanced Transformer (FITer) video inpainting method, in which the feature pre-inpainting network (FPNet) and local–global interleaving Transformer are designed. The FPNet pre-inpaints missing features before the Transformer by exploiting spatial context, and the representations of missing regions are thus enhanced with more informative content. Therefore, the interleaving Transformer can calculate more accurate self-attention weights and learns more effective dependencies between missing and valid regions. Since the interleaving Transformer involves both global and window-based local self-attention mechanisms, the proposed FITer method can effectively aggregate spatial–temporal features into missing regions while improving efficiency. Experiments on YouTube-VOS and DAVIS datasets demonstrate that the FITer method outperforms previous methods qualitatively and quantitatively.",

keywords = "Feature pre-inpainting, Local–global interleaving transformer, Video inpainting",

author = "Guanxiao Li and Ke Zhang and Yu Su and Jingyu Wang",

note = "Publisher Copyright: {\textcopyright} 2023",

year = "2023",

month = aug,

doi = "10.1016/j.engappai.2023.106323",

language = "英语",

volume = "123",

journal = "Engineering Applications of Artificial Intelligence",

issn = "0952-1976",

publisher = "Elsevier Ltd",

}

TY - JOUR

T1 - Feature pre-inpainting enhanced transformer for video inpainting

AU - Li, Guanxiao

AU - Zhang, Ke

AU - Su, Yu

AU - Wang, Jingyu

PY - 2023/8

Y1 - 2023/8

N2 - Transformer-based video inpainting methods aggregate coherent contents into missing regions by learning dependencies spatial–temporally. However, existing methods suffer from the inaccurate self-attention calculation and excessive quadratic computational complexity, due to uninformative representations of missing regions and inefficient global self-attention mechanisms, respectively. To mitigate these problems, we propose a Feature pre-Inpainting enhanced Transformer (FITer) video inpainting method, in which the feature pre-inpainting network (FPNet) and local–global interleaving Transformer are designed. The FPNet pre-inpaints missing features before the Transformer by exploiting spatial context, and the representations of missing regions are thus enhanced with more informative content. Therefore, the interleaving Transformer can calculate more accurate self-attention weights and learns more effective dependencies between missing and valid regions. Since the interleaving Transformer involves both global and window-based local self-attention mechanisms, the proposed FITer method can effectively aggregate spatial–temporal features into missing regions while improving efficiency. Experiments on YouTube-VOS and DAVIS datasets demonstrate that the FITer method outperforms previous methods qualitatively and quantitatively.

AB - Transformer-based video inpainting methods aggregate coherent contents into missing regions by learning dependencies spatial–temporally. However, existing methods suffer from the inaccurate self-attention calculation and excessive quadratic computational complexity, due to uninformative representations of missing regions and inefficient global self-attention mechanisms, respectively. To mitigate these problems, we propose a Feature pre-Inpainting enhanced Transformer (FITer) video inpainting method, in which the feature pre-inpainting network (FPNet) and local–global interleaving Transformer are designed. The FPNet pre-inpaints missing features before the Transformer by exploiting spatial context, and the representations of missing regions are thus enhanced with more informative content. Therefore, the interleaving Transformer can calculate more accurate self-attention weights and learns more effective dependencies between missing and valid regions. Since the interleaving Transformer involves both global and window-based local self-attention mechanisms, the proposed FITer method can effectively aggregate spatial–temporal features into missing regions while improving efficiency. Experiments on YouTube-VOS and DAVIS datasets demonstrate that the FITer method outperforms previous methods qualitatively and quantitatively.

KW - Feature pre-inpainting

KW - Local–global interleaving transformer

KW - Video inpainting

UR - http://www.scopus.com/inward/record.url?scp=85153309463&partnerID=8YFLogxK

U2 - 10.1016/j.engappai.2023.106323

DO - 10.1016/j.engappai.2023.106323

M3 - 文章

AN - SCOPUS:85153309463

SN - 0952-1976

VL - 123

JO - Engineering Applications of Artificial Intelligence

JF - Engineering Applications of Artificial Intelligence

M1 - 106323

ER -

Feature pre-inpainting enhanced transformer for video inpainting

摘要

访问文件

其它文件与链接

指纹

引用此