TY - JOUR
T1 - Attention-guided video super-resolution with recurrent multi-scale spatial–temporal transformer
AU - Sun, Wei
AU - Kong, Xianguang
AU - Zhang, Yanning
N1 - Publisher Copyright:
© 2022, The Author(s).
PY - 2023/8
Y1 - 2023/8
N2 - Video super-resolution (VSR) aims to recover the high-resolution (HR) contents from the low-resolution (LR) observations relying on compositing the spatial–temporal information in the LR frames. It is crucial to propagate and aggregate spatial–temporal information. Recently, while transformers show impressive performance on high-level vision tasks, few attempts have been made on image restoration, especially on VSR. In addition, previous transformers simultaneously process spatial–temporal information, easily synthesizing confused textures and high computational cost limit its development. Towards this end, we construct a novel bidirectional recurrent VSR architecture. Our model disentangles the task of learning spatial–temporal information into two easier sub-tasks, each sub-task focuses on propagating and aggregating specific information with a multi-scale transformer-based design, which alleviates the difficulty of learning. Additionally, an attention-guided motion compensation module is applied to get rid of the influence of misalignment between frames. Experiments on three widely used benchmark datasets show that, relying on superior feature correlation learning, the proposed network can outperform previous state-of-the-art methods, especially for recovering the fine details.
AB - Video super-resolution (VSR) aims to recover the high-resolution (HR) contents from the low-resolution (LR) observations relying on compositing the spatial–temporal information in the LR frames. It is crucial to propagate and aggregate spatial–temporal information. Recently, while transformers show impressive performance on high-level vision tasks, few attempts have been made on image restoration, especially on VSR. In addition, previous transformers simultaneously process spatial–temporal information, easily synthesizing confused textures and high computational cost limit its development. Towards this end, we construct a novel bidirectional recurrent VSR architecture. Our model disentangles the task of learning spatial–temporal information into two easier sub-tasks, each sub-task focuses on propagating and aggregating specific information with a multi-scale transformer-based design, which alleviates the difficulty of learning. Additionally, an attention-guided motion compensation module is applied to get rid of the influence of misalignment between frames. Experiments on three widely used benchmark datasets show that, relying on superior feature correlation learning, the proposed network can outperform previous state-of-the-art methods, especially for recovering the fine details.
KW - Attention mechanism
KW - Motion compensation
KW - Spatial-temporal transformer
KW - Video super-resolution
UR - http://www.scopus.com/inward/record.url?scp=85144025937&partnerID=8YFLogxK
U2 - 10.1007/s40747-022-00944-x
DO - 10.1007/s40747-022-00944-x
M3 - 文章
AN - SCOPUS:85144025937
SN - 2199-4536
VL - 9
SP - 3989
EP - 4002
JO - Complex and Intelligent Systems
JF - Complex and Intelligent Systems
IS - 4
ER -