Video super-resolution via mixed spatial-temporal convolution and selective fusion

Wei Sun; Dong Gong; Javen Qinfeng Shi; Anton van den Hengel; Yanning Zhang

doi:10.1016/j.patcog.2022.108577

Video super-resolution via mixed spatial-temporal convolution and selective fusion

Wei Sun, Dong Gong, Javen Qinfeng Shi, Anton van den Hengel, Yanning Zhang

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

13 引用（Scopus）

摘要

Video super-resolution aims to recover the high-resolution (HR) contents from the low-resolution (LR) observations relying on compositing the spatial-temporal information in the LR frames. It is crucial to model the spatial-temporal information jointly since the video sequences are three-dimensional spatial-temporal signals. Compared with explicitly estimating motions between the 2D frames, 3D convolutional neural networks (CNNs) have been shown its efficiency and effectiveness for video super-resolution (SR), as a natural way of spatial-temporal data modelling. Though promising, the performance of 3D CNNs is still far from satisfactory. The high computational and memory requirements limit the development of more advanced designs to extract and fuse the information from a larger spatial and temporal scale. We thus propose a Mixed Spatial-Temporal Convolution (MSTC) block that simultaneously extracts the spatial information and the supplemented temporal dependency among frames by jointly applying 2D and 3D convolution. To further fuse the learned features corresponding to different frames, we propose a novel similarity-based selective features strategy, unlike precious methods directly stacking the learned features. Additionally, an attention-based motion compensation module is applied to alleviate the influence of misalignment between frames. Experiments on three widely used benchmark datasets and real-world dataset show that, relying on superior feature extraction and fusion ability, the proposed network can outperform previous state-of-the-art methods, especially for recovering the confusing details.

源语言	英语
文章编号	108577
期刊	Pattern Recognition
卷	126
DOI	https://doi.org/10.1016/j.patcog.2022.108577
出版状态	已出版 - 6月 2022

访问文件

10.1016/j.patcog.2022.108577

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{7b25f0d19ff6478b8d4a3b2a17470514,

title = "Video super-resolution via mixed spatial-temporal convolution and selective fusion",

abstract = "Video super-resolution aims to recover the high-resolution (HR) contents from the low-resolution (LR) observations relying on compositing the spatial-temporal information in the LR frames. It is crucial to model the spatial-temporal information jointly since the video sequences are three-dimensional spatial-temporal signals. Compared with explicitly estimating motions between the 2D frames, 3D convolutional neural networks (CNNs) have been shown its efficiency and effectiveness for video super-resolution (SR), as a natural way of spatial-temporal data modelling. Though promising, the performance of 3D CNNs is still far from satisfactory. The high computational and memory requirements limit the development of more advanced designs to extract and fuse the information from a larger spatial and temporal scale. We thus propose a Mixed Spatial-Temporal Convolution (MSTC) block that simultaneously extracts the spatial information and the supplemented temporal dependency among frames by jointly applying 2D and 3D convolution. To further fuse the learned features corresponding to different frames, we propose a novel similarity-based selective features strategy, unlike precious methods directly stacking the learned features. Additionally, an attention-based motion compensation module is applied to alleviate the influence of misalignment between frames. Experiments on three widely used benchmark datasets and real-world dataset show that, relying on superior feature extraction and fusion ability, the proposed network can outperform previous state-of-the-art methods, especially for recovering the confusing details.",

keywords = "Mixed spatial-Temporal convolution, Selective feature fusion, Video super-Resolution",

author = "Wei Sun and Dong Gong and Shi, {Javen Qinfeng} and {van den Hengel}, Anton and Yanning Zhang",

note = "Publisher Copyright: {\textcopyright} 2022 Elsevier Ltd",

year = "2022",

month = jun,

doi = "10.1016/j.patcog.2022.108577",

language = "英语",

volume = "126",

journal = "Pattern Recognition",

issn = "0031-3203",

publisher = "Elsevier Ltd",

}

TY - JOUR

T1 - Video super-resolution via mixed spatial-temporal convolution and selective fusion

AU - Sun, Wei

AU - Gong, Dong

AU - Shi, Javen Qinfeng

AU - van den Hengel, Anton

AU - Zhang, Yanning

PY - 2022/6

Y1 - 2022/6

N2 - Video super-resolution aims to recover the high-resolution (HR) contents from the low-resolution (LR) observations relying on compositing the spatial-temporal information in the LR frames. It is crucial to model the spatial-temporal information jointly since the video sequences are three-dimensional spatial-temporal signals. Compared with explicitly estimating motions between the 2D frames, 3D convolutional neural networks (CNNs) have been shown its efficiency and effectiveness for video super-resolution (SR), as a natural way of spatial-temporal data modelling. Though promising, the performance of 3D CNNs is still far from satisfactory. The high computational and memory requirements limit the development of more advanced designs to extract and fuse the information from a larger spatial and temporal scale. We thus propose a Mixed Spatial-Temporal Convolution (MSTC) block that simultaneously extracts the spatial information and the supplemented temporal dependency among frames by jointly applying 2D and 3D convolution. To further fuse the learned features corresponding to different frames, we propose a novel similarity-based selective features strategy, unlike precious methods directly stacking the learned features. Additionally, an attention-based motion compensation module is applied to alleviate the influence of misalignment between frames. Experiments on three widely used benchmark datasets and real-world dataset show that, relying on superior feature extraction and fusion ability, the proposed network can outperform previous state-of-the-art methods, especially for recovering the confusing details.

AB - Video super-resolution aims to recover the high-resolution (HR) contents from the low-resolution (LR) observations relying on compositing the spatial-temporal information in the LR frames. It is crucial to model the spatial-temporal information jointly since the video sequences are three-dimensional spatial-temporal signals. Compared with explicitly estimating motions between the 2D frames, 3D convolutional neural networks (CNNs) have been shown its efficiency and effectiveness for video super-resolution (SR), as a natural way of spatial-temporal data modelling. Though promising, the performance of 3D CNNs is still far from satisfactory. The high computational and memory requirements limit the development of more advanced designs to extract and fuse the information from a larger spatial and temporal scale. We thus propose a Mixed Spatial-Temporal Convolution (MSTC) block that simultaneously extracts the spatial information and the supplemented temporal dependency among frames by jointly applying 2D and 3D convolution. To further fuse the learned features corresponding to different frames, we propose a novel similarity-based selective features strategy, unlike precious methods directly stacking the learned features. Additionally, an attention-based motion compensation module is applied to alleviate the influence of misalignment between frames. Experiments on three widely used benchmark datasets and real-world dataset show that, relying on superior feature extraction and fusion ability, the proposed network can outperform previous state-of-the-art methods, especially for recovering the confusing details.

KW - Mixed spatial-Temporal convolution

KW - Selective feature fusion

KW - Video super-Resolution

UR - http://www.scopus.com/inward/record.url?scp=85124703370&partnerID=8YFLogxK

U2 - 10.1016/j.patcog.2022.108577

DO - 10.1016/j.patcog.2022.108577

M3 - 文章

AN - SCOPUS:85124703370

SN - 0031-3203

VL - 126

JO - Pattern Recognition

JF - Pattern Recognition

M1 - 108577

ER -

Video super-resolution via mixed spatial-temporal convolution and selective fusion

摘要

访问文件

其它文件与链接

指纹

引用此