Multi-scale deep feature fusion based sparse dictionary selection for video summarization

Xiao Wu; Mingyang Ma; Shuai Wan; Xiuxiu Han; Shaohui Mei

doi:10.1016/j.image.2023.117006

Multi-scale deep feature fusion based sparse dictionary selection for video summarization

Xiao Wu, Mingyang Ma, Shuai Wan, Xiuxiu Han, Shaohui Mei

电子信息学院

科研成果: 期刊稿件 › 文章 › 同行评审

2 引用（Scopus）

摘要

The explosive growth of video data constitutes a series of new challenges in computer vision, and the function of video summarization (VS) is becoming more and more prominent. Recent works have shown the effectiveness of sparse dictionary selection (SDS) based VS, which selects a representative frame set to sufficiently reconstruct a given video. Existing SDS based VS methods use conventional handcrafted features or single-scale deep features, which could diminish their summarization performance due to the underutilization of frame feature representation. Deep learning techniques based on convolutional neural networks (CNNs) exhibit powerful capabilities among various vision tasks, as the CNN provides excellent feature representation. Therefore, in this paper, a multi-scale deep feature fusion based sparse dictionary selection (MSDFF-SDS) is proposed for VS. Specifically, multi-scale features include the directly extracted features from the last fully connected layer and the global average pooling (GAP) processed features from intermediate layers, then VS is formulated as a problem of minimizing the reconstruction error using the multi-scale deep feature fusion. In our formulation, the contribution of each scale of features can be adjusted by a balance parameter, and the row-sparsity consistency of the simultaneous reconstruction coefficient is used to select as few keyframes as possible. The resulting MSDFF-SDS model is solved by using an efficient greedy pursuit algorithm. Experimental results on two benchmark datasets demonstrate that the proposed MSDFF-SDS improves the F-score of keyframe based summarization more than 3% compared with the existing SDS methods, and performs better than most deep-learning methods for skimming based summarization.

源语言	英语
文章编号	117006
期刊	Signal Processing: Image Communication
卷	118
DOI	https://doi.org/10.1016/j.image.2023.117006
出版状态	已出版 - 10月 2023

访问文件

10.1016/j.image.2023.117006

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{361f1fe7421b4877a0160a01938328fa,

title = "Multi-scale deep feature fusion based sparse dictionary selection for video summarization",

abstract = "The explosive growth of video data constitutes a series of new challenges in computer vision, and the function of video summarization (VS) is becoming more and more prominent. Recent works have shown the effectiveness of sparse dictionary selection (SDS) based VS, which selects a representative frame set to sufficiently reconstruct a given video. Existing SDS based VS methods use conventional handcrafted features or single-scale deep features, which could diminish their summarization performance due to the underutilization of frame feature representation. Deep learning techniques based on convolutional neural networks (CNNs) exhibit powerful capabilities among various vision tasks, as the CNN provides excellent feature representation. Therefore, in this paper, a multi-scale deep feature fusion based sparse dictionary selection (MSDFF-SDS) is proposed for VS. Specifically, multi-scale features include the directly extracted features from the last fully connected layer and the global average pooling (GAP) processed features from intermediate layers, then VS is formulated as a problem of minimizing the reconstruction error using the multi-scale deep feature fusion. In our formulation, the contribution of each scale of features can be adjusted by a balance parameter, and the row-sparsity consistency of the simultaneous reconstruction coefficient is used to select as few keyframes as possible. The resulting MSDFF-SDS model is solved by using an efficient greedy pursuit algorithm. Experimental results on two benchmark datasets demonstrate that the proposed MSDFF-SDS improves the F-score of keyframe based summarization more than 3% compared with the existing SDS methods, and performs better than most deep-learning methods for skimming based summarization.",

keywords = "Dictionary selection, Keyframe, Multi-scale, Sparse coding, Video summarization",

author = "Xiao Wu and Mingyang Ma and Shuai Wan and Xiuxiu Han and Shaohui Mei",

note = "Publisher Copyright: {\textcopyright} 2023 Elsevier B.V.",

year = "2023",

month = oct,

doi = "10.1016/j.image.2023.117006",

language = "英语",

volume = "118",

journal = "Signal Processing: Image Communication",

issn = "0923-5965",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Multi-scale deep feature fusion based sparse dictionary selection for video summarization

AU - Wu, Xiao

AU - Ma, Mingyang

AU - Wan, Shuai

AU - Han, Xiuxiu

AU - Mei, Shaohui

PY - 2023/10

Y1 - 2023/10

N2 - The explosive growth of video data constitutes a series of new challenges in computer vision, and the function of video summarization (VS) is becoming more and more prominent. Recent works have shown the effectiveness of sparse dictionary selection (SDS) based VS, which selects a representative frame set to sufficiently reconstruct a given video. Existing SDS based VS methods use conventional handcrafted features or single-scale deep features, which could diminish their summarization performance due to the underutilization of frame feature representation. Deep learning techniques based on convolutional neural networks (CNNs) exhibit powerful capabilities among various vision tasks, as the CNN provides excellent feature representation. Therefore, in this paper, a multi-scale deep feature fusion based sparse dictionary selection (MSDFF-SDS) is proposed for VS. Specifically, multi-scale features include the directly extracted features from the last fully connected layer and the global average pooling (GAP) processed features from intermediate layers, then VS is formulated as a problem of minimizing the reconstruction error using the multi-scale deep feature fusion. In our formulation, the contribution of each scale of features can be adjusted by a balance parameter, and the row-sparsity consistency of the simultaneous reconstruction coefficient is used to select as few keyframes as possible. The resulting MSDFF-SDS model is solved by using an efficient greedy pursuit algorithm. Experimental results on two benchmark datasets demonstrate that the proposed MSDFF-SDS improves the F-score of keyframe based summarization more than 3% compared with the existing SDS methods, and performs better than most deep-learning methods for skimming based summarization.

AB - The explosive growth of video data constitutes a series of new challenges in computer vision, and the function of video summarization (VS) is becoming more and more prominent. Recent works have shown the effectiveness of sparse dictionary selection (SDS) based VS, which selects a representative frame set to sufficiently reconstruct a given video. Existing SDS based VS methods use conventional handcrafted features or single-scale deep features, which could diminish their summarization performance due to the underutilization of frame feature representation. Deep learning techniques based on convolutional neural networks (CNNs) exhibit powerful capabilities among various vision tasks, as the CNN provides excellent feature representation. Therefore, in this paper, a multi-scale deep feature fusion based sparse dictionary selection (MSDFF-SDS) is proposed for VS. Specifically, multi-scale features include the directly extracted features from the last fully connected layer and the global average pooling (GAP) processed features from intermediate layers, then VS is formulated as a problem of minimizing the reconstruction error using the multi-scale deep feature fusion. In our formulation, the contribution of each scale of features can be adjusted by a balance parameter, and the row-sparsity consistency of the simultaneous reconstruction coefficient is used to select as few keyframes as possible. The resulting MSDFF-SDS model is solved by using an efficient greedy pursuit algorithm. Experimental results on two benchmark datasets demonstrate that the proposed MSDFF-SDS improves the F-score of keyframe based summarization more than 3% compared with the existing SDS methods, and performs better than most deep-learning methods for skimming based summarization.

KW - Dictionary selection

KW - Keyframe

KW - Multi-scale

KW - Sparse coding

KW - Video summarization

UR - http://www.scopus.com/inward/record.url?scp=85165964765&partnerID=8YFLogxK

U2 - 10.1016/j.image.2023.117006

DO - 10.1016/j.image.2023.117006

M3 - 文章

AN - SCOPUS:85165964765

SN - 0923-5965

VL - 118

JO - Signal Processing: Image Communication

JF - Signal Processing: Image Communication

M1 - 117006

ER -

Multi-scale deep feature fusion based sparse dictionary selection for video summarization

摘要

访问文件

其它文件与链接

指纹

引用此