TY - JOUR
T1 - I2Net
T2 - Mining intra-video and inter-video attention for temporal action localization
AU - Zhang, Wei
AU - Wang, Binglu
AU - Ma, Songhui
AU - Zhang, Yani
AU - Zhao, Yongqiang
N1 - Publisher Copyright:
© 2021 Elsevier B.V.
PY - 2021/7/15
Y1 - 2021/7/15
N2 - This paper focuses on two challenges for temporal action localization community, i.e., lack of long-term relationship and action pattern uncertainty. The former prevents the cooperation among multiple action instances within a video, while the latter may cause incomplete localizations or false positives. The lack of long-term relationship challenge results from the limited receptive field. Instead of stacking multiple layers or using large convolution kernels, we propose the intra-video attention mechanism to bring global receptive field to each temporal point. As for the action pattern uncertainty challenge, although it is hard to precisely depict the desired action pattern, paired videos that share the same action category can provide complementary information about action pattern. Consequently, we propose an inter-video attention mechanism to assist learning accurate action patterns. Based on the intra-video attention and inter-video attention, we propose a unified framework, namely I2Net, to tackle the challenging temporal action localization task. Given two videos containing sharing action categories, I2Net adopts the widely used one-stage action localization paradigm to dispose of them in parallel. As for two neighboring layers within the same video, the intra-video attention brings global information to each temporal point and helps to learn representative features. As for two parallel layers between two videos, the inter-video attention introduces complementary information to each video and helps to learn accurate action patterns. With the cooperation of intra-video and inter-video attention mechanisms, I2Net shows obvious performance gains over the baseline and builds new state-of-the-art on two widely-used benchmarks, i.e., THUMOS14 and ActivityNet v1.3.
AB - This paper focuses on two challenges for temporal action localization community, i.e., lack of long-term relationship and action pattern uncertainty. The former prevents the cooperation among multiple action instances within a video, while the latter may cause incomplete localizations or false positives. The lack of long-term relationship challenge results from the limited receptive field. Instead of stacking multiple layers or using large convolution kernels, we propose the intra-video attention mechanism to bring global receptive field to each temporal point. As for the action pattern uncertainty challenge, although it is hard to precisely depict the desired action pattern, paired videos that share the same action category can provide complementary information about action pattern. Consequently, we propose an inter-video attention mechanism to assist learning accurate action patterns. Based on the intra-video attention and inter-video attention, we propose a unified framework, namely I2Net, to tackle the challenging temporal action localization task. Given two videos containing sharing action categories, I2Net adopts the widely used one-stage action localization paradigm to dispose of them in parallel. As for two neighboring layers within the same video, the intra-video attention brings global information to each temporal point and helps to learn representative features. As for two parallel layers between two videos, the inter-video attention introduces complementary information to each video and helps to learn accurate action patterns. With the cooperation of intra-video and inter-video attention mechanisms, I2Net shows obvious performance gains over the baseline and builds new state-of-the-art on two widely-used benchmarks, i.e., THUMOS14 and ActivityNet v1.3.
KW - Inter-video attention
KW - Intra-video attention
KW - Temporal action localization
UR - http://www.scopus.com/inward/record.url?scp=85104973374&partnerID=8YFLogxK
U2 - 10.1016/j.neucom.2021.02.085
DO - 10.1016/j.neucom.2021.02.085
M3 - 文章
AN - SCOPUS:85104973374
SN - 0925-2312
VL - 444
SP - 16
EP - 29
JO - Neurocomputing
JF - Neurocomputing
ER -