TY - GEN
T1 - Top-k Self-Attention in Transformer for Video Inpainting
AU - Li, Guanxiao
AU - Zhang, Ke
AU - Su, Yu
AU - Wang, Jing Yu
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Video inpainting restores missing content using global dependencies and relevant non-local frame portions. Recent Transformer-based techniques utilize self-attention mechanisms to establish connections between global patch embeddings. However, due to the scarcity of relevant regions, existing methods end up allocating partial attention weights to a significant number of irrelevant areas. This situation results in a dispersion of dependencies, which negatively impacts modeling accuracy. To address this issue, we introduce a top-k self-attention mechanism specifically designed for Transformer-based video inpainting, which filters out the weights of less relevant regions. This proposed mechanism computes a top-k weight threshold for each missing patch and compels the Transformer to focus on the k most pertinent patch embeddings. As a result, the accuracy of dependency modeling is enhanced, leading to more effective content aggregation for filling in the missing regions. The top-k mechanism is easily integrated into any Transformer-based model, and experiments conducted on the YouTube-VOS and DAVIS datasets show that it significantly improves the model's performance while maintaining high efficiency.
AB - Video inpainting restores missing content using global dependencies and relevant non-local frame portions. Recent Transformer-based techniques utilize self-attention mechanisms to establish connections between global patch embeddings. However, due to the scarcity of relevant regions, existing methods end up allocating partial attention weights to a significant number of irrelevant areas. This situation results in a dispersion of dependencies, which negatively impacts modeling accuracy. To address this issue, we introduce a top-k self-attention mechanism specifically designed for Transformer-based video inpainting, which filters out the weights of less relevant regions. This proposed mechanism computes a top-k weight threshold for each missing patch and compels the Transformer to focus on the k most pertinent patch embeddings. As a result, the accuracy of dependency modeling is enhanced, leading to more effective content aggregation for filling in the missing regions. The top-k mechanism is easily integrated into any Transformer-based model, and experiments conducted on the YouTube-VOS and DAVIS datasets show that it significantly improves the model's performance while maintaining high efficiency.
KW - top-k self-attention mechanism
KW - Video inpainting
KW - vision Transformer
UR - http://www.scopus.com/inward/record.url?scp=85201145404&partnerID=8YFLogxK
U2 - 10.1109/ICCEA62105.2024.10603668
DO - 10.1109/ICCEA62105.2024.10603668
M3 - 会议稿件
AN - SCOPUS:85201145404
T3 - 2024 5th International Conference on Computer Engineering and Application, ICCEA 2024
SP - 1038
EP - 1042
BT - 2024 5th International Conference on Computer Engineering and Application, ICCEA 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 5th International Conference on Computer Engineering and Application, ICCEA 2024
Y2 - 12 April 2024 through 14 April 2024
ER -