I2Net: Mining intra-video and inter-video attention for temporal action localization

Wei Zhang; Binglu Wang; Songhui Ma; Yani Zhang; Yongqiang Zhao

doi:10.1016/j.neucom.2021.02.085

I2Net: Mining intra-video and inter-video attention for temporal action localization

Wei Zhang, Binglu Wang, Songhui Ma, Yani Zhang, Yongqiang Zhao

自动化学院

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 文章 › 同行评审

17 引用（Scopus）

摘要

This paper focuses on two challenges for temporal action localization community, i.e., lack of long-term relationship and action pattern uncertainty. The former prevents the cooperation among multiple action instances within a video, while the latter may cause incomplete localizations or false positives. The lack of long-term relationship challenge results from the limited receptive field. Instead of stacking multiple layers or using large convolution kernels, we propose the intra-video attention mechanism to bring global receptive field to each temporal point. As for the action pattern uncertainty challenge, although it is hard to precisely depict the desired action pattern, paired videos that share the same action category can provide complementary information about action pattern. Consequently, we propose an inter-video attention mechanism to assist learning accurate action patterns. Based on the intra-video attention and inter-video attention, we propose a unified framework, namely I2Net, to tackle the challenging temporal action localization task. Given two videos containing sharing action categories, I2Net adopts the widely used one-stage action localization paradigm to dispose of them in parallel. As for two neighboring layers within the same video, the intra-video attention brings global information to each temporal point and helps to learn representative features. As for two parallel layers between two videos, the inter-video attention introduces complementary information to each video and helps to learn accurate action patterns. With the cooperation of intra-video and inter-video attention mechanisms, I2Net shows obvious performance gains over the baseline and builds new state-of-the-art on two widely-used benchmarks, i.e., THUMOS14 and ActivityNet v1.3.

源语言	英语
页（从-至）	16-29
页数	14
期刊	Neurocomputing
卷	444
DOI	https://doi.org/10.1016/j.neucom.2021.02.085
出版状态	已出版 - 15 7月 2021

访问文件

10.1016/j.neucom.2021.02.085

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{bac37736eb634992928254da4c485191,

title = "I2Net: Mining intra-video and inter-video attention for temporal action localization",

abstract = "This paper focuses on two challenges for temporal action localization community, i.e., lack of long-term relationship and action pattern uncertainty. The former prevents the cooperation among multiple action instances within a video, while the latter may cause incomplete localizations or false positives. The lack of long-term relationship challenge results from the limited receptive field. Instead of stacking multiple layers or using large convolution kernels, we propose the intra-video attention mechanism to bring global receptive field to each temporal point. As for the action pattern uncertainty challenge, although it is hard to precisely depict the desired action pattern, paired videos that share the same action category can provide complementary information about action pattern. Consequently, we propose an inter-video attention mechanism to assist learning accurate action patterns. Based on the intra-video attention and inter-video attention, we propose a unified framework, namely I2Net, to tackle the challenging temporal action localization task. Given two videos containing sharing action categories, I2Net adopts the widely used one-stage action localization paradigm to dispose of them in parallel. As for two neighboring layers within the same video, the intra-video attention brings global information to each temporal point and helps to learn representative features. As for two parallel layers between two videos, the inter-video attention introduces complementary information to each video and helps to learn accurate action patterns. With the cooperation of intra-video and inter-video attention mechanisms, I2Net shows obvious performance gains over the baseline and builds new state-of-the-art on two widely-used benchmarks, i.e., THUMOS14 and ActivityNet v1.3.",

keywords = "Inter-video attention, Intra-video attention, Temporal action localization",

author = "Wei Zhang and Binglu Wang and Songhui Ma and Yani Zhang and Yongqiang Zhao",

note = "Publisher Copyright: {\textcopyright} 2021 Elsevier B.V.",

year = "2021",

month = jul,

day = "15",

doi = "10.1016/j.neucom.2021.02.085",

language = "英语",

volume = "444",

pages = "16--29",

journal = "Neurocomputing",

issn = "0925-2312",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - I2Net

T2 - Mining intra-video and inter-video attention for temporal action localization

AU - Zhang, Wei

AU - Wang, Binglu

AU - Ma, Songhui

AU - Zhang, Yani

AU - Zhao, Yongqiang

PY - 2021/7/15

Y1 - 2021/7/15

N2 - This paper focuses on two challenges for temporal action localization community, i.e., lack of long-term relationship and action pattern uncertainty. The former prevents the cooperation among multiple action instances within a video, while the latter may cause incomplete localizations or false positives. The lack of long-term relationship challenge results from the limited receptive field. Instead of stacking multiple layers or using large convolution kernels, we propose the intra-video attention mechanism to bring global receptive field to each temporal point. As for the action pattern uncertainty challenge, although it is hard to precisely depict the desired action pattern, paired videos that share the same action category can provide complementary information about action pattern. Consequently, we propose an inter-video attention mechanism to assist learning accurate action patterns. Based on the intra-video attention and inter-video attention, we propose a unified framework, namely I2Net, to tackle the challenging temporal action localization task. Given two videos containing sharing action categories, I2Net adopts the widely used one-stage action localization paradigm to dispose of them in parallel. As for two neighboring layers within the same video, the intra-video attention brings global information to each temporal point and helps to learn representative features. As for two parallel layers between two videos, the inter-video attention introduces complementary information to each video and helps to learn accurate action patterns. With the cooperation of intra-video and inter-video attention mechanisms, I2Net shows obvious performance gains over the baseline and builds new state-of-the-art on two widely-used benchmarks, i.e., THUMOS14 and ActivityNet v1.3.

AB - This paper focuses on two challenges for temporal action localization community, i.e., lack of long-term relationship and action pattern uncertainty. The former prevents the cooperation among multiple action instances within a video, while the latter may cause incomplete localizations or false positives. The lack of long-term relationship challenge results from the limited receptive field. Instead of stacking multiple layers or using large convolution kernels, we propose the intra-video attention mechanism to bring global receptive field to each temporal point. As for the action pattern uncertainty challenge, although it is hard to precisely depict the desired action pattern, paired videos that share the same action category can provide complementary information about action pattern. Consequently, we propose an inter-video attention mechanism to assist learning accurate action patterns. Based on the intra-video attention and inter-video attention, we propose a unified framework, namely I2Net, to tackle the challenging temporal action localization task. Given two videos containing sharing action categories, I2Net adopts the widely used one-stage action localization paradigm to dispose of them in parallel. As for two neighboring layers within the same video, the intra-video attention brings global information to each temporal point and helps to learn representative features. As for two parallel layers between two videos, the inter-video attention introduces complementary information to each video and helps to learn accurate action patterns. With the cooperation of intra-video and inter-video attention mechanisms, I2Net shows obvious performance gains over the baseline and builds new state-of-the-art on two widely-used benchmarks, i.e., THUMOS14 and ActivityNet v1.3.

KW - Inter-video attention

KW - Intra-video attention

KW - Temporal action localization

UR - http://www.scopus.com/inward/record.url?scp=85104973374&partnerID=8YFLogxK

U2 - 10.1016/j.neucom.2021.02.085

DO - 10.1016/j.neucom.2021.02.085

M3 - 文章

AN - SCOPUS:85104973374

SN - 0925-2312

VL - 444

SP - 16

EP - 29

JO - Neurocomputing

JF - Neurocomputing

ER -

I2Net: Mining intra-video and inter-video attention for temporal action localization

摘要

访问文件

其它文件与链接

指纹

引用此