Top-k Self-Attention in Transformer for Video Inpainting

Guanxiao Li; Ke Zhang; Yu Su; Jing Yu Wang

doi:10.1109/ICCEA62105.2024.10603668

Top-k Self-Attention in Transformer for Video Inpainting

Guanxiao Li, Ke Zhang, Yu Su, Jing Yu Wang

School of Astronautics

Northwestern Polytechnical University Xian

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

Video inpainting restores missing content using global dependencies and relevant non-local frame portions. Recent Transformer-based techniques utilize self-attention mechanisms to establish connections between global patch embeddings. However, due to the scarcity of relevant regions, existing methods end up allocating partial attention weights to a significant number of irrelevant areas. This situation results in a dispersion of dependencies, which negatively impacts modeling accuracy. To address this issue, we introduce a top-k self-attention mechanism specifically designed for Transformer-based video inpainting, which filters out the weights of less relevant regions. This proposed mechanism computes a top-k weight threshold for each missing patch and compels the Transformer to focus on the k most pertinent patch embeddings. As a result, the accuracy of dependency modeling is enhanced, leading to more effective content aggregation for filling in the missing regions. The top-k mechanism is easily integrated into any Transformer-based model, and experiments conducted on the YouTube-VOS and DAVIS datasets show that it significantly improves the model's performance while maintaining high efficiency.

Original language	English
Title of host publication	2024 5th International Conference on Computer Engineering and Application, ICCEA 2024
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	1038-1042
Number of pages	5
ISBN (Electronic)	9798350386776
DOIs	https://doi.org/10.1109/ICCEA62105.2024.10603668
State	Published - 2024
Event	5th International Conference on Computer Engineering and Application, ICCEA 2024 - Hybrid, Hangzhou, China Duration: 12 Apr 2024 → 14 Apr 2024

Publication series

Name	2024 5th International Conference on Computer Engineering and Application, ICCEA 2024

Conference

Conference	5th International Conference on Computer Engineering and Application, ICCEA 2024
Country/Territory	China
City	Hybrid, Hangzhou
Period	12/04/24 → 14/04/24

Keywords

top-k self-attention mechanism
Video inpainting
vision Transformer

Access to Document

10.1109/ICCEA62105.2024.10603668

Cite this

Li, G., Zhang, K., Su, Y., & Wang, J. Y. (2024). Top-k Self-Attention in Transformer for Video Inpainting. In 2024 5th International Conference on Computer Engineering and Application, ICCEA 2024 (pp. 1038-1042). (2024 5th International Conference on Computer Engineering and Application, ICCEA 2024). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICCEA62105.2024.10603668

@inproceedings{aeb96cad50f041d6ae08c9e6a55e3386,

title = "Top-k Self-Attention in Transformer for Video Inpainting",

abstract = "Video inpainting restores missing content using global dependencies and relevant non-local frame portions. Recent Transformer-based techniques utilize self-attention mechanisms to establish connections between global patch embeddings. However, due to the scarcity of relevant regions, existing methods end up allocating partial attention weights to a significant number of irrelevant areas. This situation results in a dispersion of dependencies, which negatively impacts modeling accuracy. To address this issue, we introduce a top-k self-attention mechanism specifically designed for Transformer-based video inpainting, which filters out the weights of less relevant regions. This proposed mechanism computes a top-k weight threshold for each missing patch and compels the Transformer to focus on the k most pertinent patch embeddings. As a result, the accuracy of dependency modeling is enhanced, leading to more effective content aggregation for filling in the missing regions. The top-k mechanism is easily integrated into any Transformer-based model, and experiments conducted on the YouTube-VOS and DAVIS datasets show that it significantly improves the model's performance while maintaining high efficiency.",

keywords = "top-k self-attention mechanism, Video inpainting, vision Transformer",

author = "Guanxiao Li and Ke Zhang and Yu Su and Wang, {Jing Yu}",

note = "Publisher Copyright: {\textcopyright} 2024 IEEE.; 5th International Conference on Computer Engineering and Application, ICCEA 2024 ; Conference date: 12-04-2024 Through 14-04-2024",

year = "2024",

doi = "10.1109/ICCEA62105.2024.10603668",

language = "英语",

series = "2024 5th International Conference on Computer Engineering and Application, ICCEA 2024",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "1038--1042",

booktitle = "2024 5th International Conference on Computer Engineering and Application, ICCEA 2024",

}

Li, G, Zhang, K, Su, Y & Wang, JY 2024, Top-k Self-Attention in Transformer for Video Inpainting. in 2024 5th International Conference on Computer Engineering and Application, ICCEA 2024. 2024 5th International Conference on Computer Engineering and Application, ICCEA 2024, Institute of Electrical and Electronics Engineers Inc., pp. 1038-1042, 5th International Conference on Computer Engineering and Application, ICCEA 2024, Hybrid, Hangzhou, China, 12/04/24. https://doi.org/10.1109/ICCEA62105.2024.10603668

Top-k Self-Attention in Transformer for Video Inpainting. / Li, Guanxiao; Zhang, Ke; Su, Yu et al.
2024 5th International Conference on Computer Engineering and Application, ICCEA 2024. Institute of Electrical and Electronics Engineers Inc., 2024. p. 1038-1042 (2024 5th International Conference on Computer Engineering and Application, ICCEA 2024).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Top-k Self-Attention in Transformer for Video Inpainting

AU - Li, Guanxiao

AU - Zhang, Ke

AU - Su, Yu

AU - Wang, Jing Yu

PY - 2024

Y1 - 2024

N2 - Video inpainting restores missing content using global dependencies and relevant non-local frame portions. Recent Transformer-based techniques utilize self-attention mechanisms to establish connections between global patch embeddings. However, due to the scarcity of relevant regions, existing methods end up allocating partial attention weights to a significant number of irrelevant areas. This situation results in a dispersion of dependencies, which negatively impacts modeling accuracy. To address this issue, we introduce a top-k self-attention mechanism specifically designed for Transformer-based video inpainting, which filters out the weights of less relevant regions. This proposed mechanism computes a top-k weight threshold for each missing patch and compels the Transformer to focus on the k most pertinent patch embeddings. As a result, the accuracy of dependency modeling is enhanced, leading to more effective content aggregation for filling in the missing regions. The top-k mechanism is easily integrated into any Transformer-based model, and experiments conducted on the YouTube-VOS and DAVIS datasets show that it significantly improves the model's performance while maintaining high efficiency.

AB - Video inpainting restores missing content using global dependencies and relevant non-local frame portions. Recent Transformer-based techniques utilize self-attention mechanisms to establish connections between global patch embeddings. However, due to the scarcity of relevant regions, existing methods end up allocating partial attention weights to a significant number of irrelevant areas. This situation results in a dispersion of dependencies, which negatively impacts modeling accuracy. To address this issue, we introduce a top-k self-attention mechanism specifically designed for Transformer-based video inpainting, which filters out the weights of less relevant regions. This proposed mechanism computes a top-k weight threshold for each missing patch and compels the Transformer to focus on the k most pertinent patch embeddings. As a result, the accuracy of dependency modeling is enhanced, leading to more effective content aggregation for filling in the missing regions. The top-k mechanism is easily integrated into any Transformer-based model, and experiments conducted on the YouTube-VOS and DAVIS datasets show that it significantly improves the model's performance while maintaining high efficiency.

KW - top-k self-attention mechanism

KW - Video inpainting

KW - vision Transformer

UR - http://www.scopus.com/inward/record.url?scp=85201145404&partnerID=8YFLogxK

U2 - 10.1109/ICCEA62105.2024.10603668

DO - 10.1109/ICCEA62105.2024.10603668

M3 - 会议稿件

AN - SCOPUS:85201145404

T3 - 2024 5th International Conference on Computer Engineering and Application, ICCEA 2024

SP - 1038

EP - 1042

BT - 2024 5th International Conference on Computer Engineering and Application, ICCEA 2024

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 5th International Conference on Computer Engineering and Application, ICCEA 2024

Y2 - 12 April 2024 through 14 April 2024

ER -

Li G, Zhang K, Su Y, Wang JY. Top-k Self-Attention in Transformer for Video Inpainting. In 2024 5th International Conference on Computer Engineering and Application, ICCEA 2024. Institute of Electrical and Electronics Engineers Inc. 2024. p. 1038-1042. (2024 5th International Conference on Computer Engineering and Application, ICCEA 2024). doi: 10.1109/ICCEA62105.2024.10603668

Top-k Self-Attention in Transformer for Video Inpainting

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this