TY - JOUR
T1 - Feature pre-inpainting enhanced transformer for video inpainting
AU - Li, Guanxiao
AU - Zhang, Ke
AU - Su, Yu
AU - Wang, Jingyu
N1 - Publisher Copyright:
© 2023
PY - 2023/8
Y1 - 2023/8
N2 - Transformer-based video inpainting methods aggregate coherent contents into missing regions by learning dependencies spatial–temporally. However, existing methods suffer from the inaccurate self-attention calculation and excessive quadratic computational complexity, due to uninformative representations of missing regions and inefficient global self-attention mechanisms, respectively. To mitigate these problems, we propose a Feature pre-Inpainting enhanced Transformer (FITer) video inpainting method, in which the feature pre-inpainting network (FPNet) and local–global interleaving Transformer are designed. The FPNet pre-inpaints missing features before the Transformer by exploiting spatial context, and the representations of missing regions are thus enhanced with more informative content. Therefore, the interleaving Transformer can calculate more accurate self-attention weights and learns more effective dependencies between missing and valid regions. Since the interleaving Transformer involves both global and window-based local self-attention mechanisms, the proposed FITer method can effectively aggregate spatial–temporal features into missing regions while improving efficiency. Experiments on YouTube-VOS and DAVIS datasets demonstrate that the FITer method outperforms previous methods qualitatively and quantitatively.
AB - Transformer-based video inpainting methods aggregate coherent contents into missing regions by learning dependencies spatial–temporally. However, existing methods suffer from the inaccurate self-attention calculation and excessive quadratic computational complexity, due to uninformative representations of missing regions and inefficient global self-attention mechanisms, respectively. To mitigate these problems, we propose a Feature pre-Inpainting enhanced Transformer (FITer) video inpainting method, in which the feature pre-inpainting network (FPNet) and local–global interleaving Transformer are designed. The FPNet pre-inpaints missing features before the Transformer by exploiting spatial context, and the representations of missing regions are thus enhanced with more informative content. Therefore, the interleaving Transformer can calculate more accurate self-attention weights and learns more effective dependencies between missing and valid regions. Since the interleaving Transformer involves both global and window-based local self-attention mechanisms, the proposed FITer method can effectively aggregate spatial–temporal features into missing regions while improving efficiency. Experiments on YouTube-VOS and DAVIS datasets demonstrate that the FITer method outperforms previous methods qualitatively and quantitatively.
KW - Feature pre-inpainting
KW - Local–global interleaving transformer
KW - Video inpainting
UR - http://www.scopus.com/inward/record.url?scp=85153309463&partnerID=8YFLogxK
U2 - 10.1016/j.engappai.2023.106323
DO - 10.1016/j.engappai.2023.106323
M3 - 文章
AN - SCOPUS:85153309463
SN - 0952-1976
VL - 123
JO - Engineering Applications of Artificial Intelligence
JF - Engineering Applications of Artificial Intelligence
M1 - 106323
ER -