Video summarization with a dual-path attentive network

Guoqiang Liang; Yanbing Lv; Shucheng Li; Xiahong Wang; Yanning Zhang

doi:10.1016/j.neucom.2021.09.015

Video summarization with a dual-path attentive network

Guoqiang Liang, Yanbing Lv, Shucheng Li, Xiahong Wang, Yanning Zhang

计算机学院

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 文章 › 同行评审

21 引用（Scopus）

摘要

With the explosive growth of videos captured everyday, how to efficiently extract useful information from videos has become a more and more important problem. As one of the most effective methods, video summarization aiming to extract the most important frames or shots has attracted more interests recently. Currently, lots of methods employ a recurrent structure. However, due to its step-by-step characteristic, it is difficult to parallelize these models. To address this problem, we propose a dual-path attentive video summarization framework consisting of a temporal spatial encoder, a score-aware encoder and a decoder. And all of them are mainly based on multi-head self-attention and convolutional block attention module. The temporal spatial encoder is to capture the temporal and spatial information while the score-aware encoder incorporates the appearance features with previously predicted frame-level importance scores. By combining the scores and appearance features, our model can better capture the long-range global dependencies and update the importance scores of previous frames continuously. Moreover, entirely based on attention mechanism, our model can be trained in full parallel, which leads to less training time. To validate the method, we employ the two popular datasets SumMe and TVSum. The experimental results show the effectiveness of the proposed method.

源语言	英语
页（从-至）	1-9
页数	9
期刊	Neurocomputing
卷	467
DOI	https://doi.org/10.1016/j.neucom.2021.09.015
出版状态	已出版 - 7 1月 2022

访问文件

10.1016/j.neucom.2021.09.015

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{8274a2b84d404efb80e20daf516d442a,

title = "Video summarization with a dual-path attentive network",

abstract = "With the explosive growth of videos captured everyday, how to efficiently extract useful information from videos has become a more and more important problem. As one of the most effective methods, video summarization aiming to extract the most important frames or shots has attracted more interests recently. Currently, lots of methods employ a recurrent structure. However, due to its step-by-step characteristic, it is difficult to parallelize these models. To address this problem, we propose a dual-path attentive video summarization framework consisting of a temporal spatial encoder, a score-aware encoder and a decoder. And all of them are mainly based on multi-head self-attention and convolutional block attention module. The temporal spatial encoder is to capture the temporal and spatial information while the score-aware encoder incorporates the appearance features with previously predicted frame-level importance scores. By combining the scores and appearance features, our model can better capture the long-range global dependencies and update the importance scores of previous frames continuously. Moreover, entirely based on attention mechanism, our model can be trained in full parallel, which leads to less training time. To validate the method, we employ the two popular datasets SumMe and TVSum. The experimental results show the effectiveness of the proposed method.",

keywords = "Attention mechanism, Encoder-decoder, Video summarization",

author = "Guoqiang Liang and Yanbing Lv and Shucheng Li and Xiahong Wang and Yanning Zhang",

note = "Publisher Copyright: {\textcopyright} 2021 Elsevier B.V.",

year = "2022",

month = jan,

day = "7",

doi = "10.1016/j.neucom.2021.09.015",

language = "英语",

volume = "467",

pages = "1--9",

journal = "Neurocomputing",

issn = "0925-2312",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Video summarization with a dual-path attentive network

AU - Liang, Guoqiang

AU - Lv, Yanbing

AU - Li, Shucheng

AU - Wang, Xiahong

AU - Zhang, Yanning

PY - 2022/1/7

Y1 - 2022/1/7

N2 - With the explosive growth of videos captured everyday, how to efficiently extract useful information from videos has become a more and more important problem. As one of the most effective methods, video summarization aiming to extract the most important frames or shots has attracted more interests recently. Currently, lots of methods employ a recurrent structure. However, due to its step-by-step characteristic, it is difficult to parallelize these models. To address this problem, we propose a dual-path attentive video summarization framework consisting of a temporal spatial encoder, a score-aware encoder and a decoder. And all of them are mainly based on multi-head self-attention and convolutional block attention module. The temporal spatial encoder is to capture the temporal and spatial information while the score-aware encoder incorporates the appearance features with previously predicted frame-level importance scores. By combining the scores and appearance features, our model can better capture the long-range global dependencies and update the importance scores of previous frames continuously. Moreover, entirely based on attention mechanism, our model can be trained in full parallel, which leads to less training time. To validate the method, we employ the two popular datasets SumMe and TVSum. The experimental results show the effectiveness of the proposed method.

AB - With the explosive growth of videos captured everyday, how to efficiently extract useful information from videos has become a more and more important problem. As one of the most effective methods, video summarization aiming to extract the most important frames or shots has attracted more interests recently. Currently, lots of methods employ a recurrent structure. However, due to its step-by-step characteristic, it is difficult to parallelize these models. To address this problem, we propose a dual-path attentive video summarization framework consisting of a temporal spatial encoder, a score-aware encoder and a decoder. And all of them are mainly based on multi-head self-attention and convolutional block attention module. The temporal spatial encoder is to capture the temporal and spatial information while the score-aware encoder incorporates the appearance features with previously predicted frame-level importance scores. By combining the scores and appearance features, our model can better capture the long-range global dependencies and update the importance scores of previous frames continuously. Moreover, entirely based on attention mechanism, our model can be trained in full parallel, which leads to less training time. To validate the method, we employ the two popular datasets SumMe and TVSum. The experimental results show the effectiveness of the proposed method.

KW - Attention mechanism

KW - Encoder-decoder

KW - Video summarization

UR - http://www.scopus.com/inward/record.url?scp=85116580167&partnerID=8YFLogxK

U2 - 10.1016/j.neucom.2021.09.015

DO - 10.1016/j.neucom.2021.09.015

M3 - 文章

AN - SCOPUS:85116580167

SN - 0925-2312

VL - 467

SP - 1

EP - 9

JO - Neurocomputing

JF - Neurocomputing

ER -

Video summarization with a dual-path attentive network

摘要

访问文件

其它文件与链接

指纹

引用此