I2Net: Mining intra-video and inter-video attention for temporal action localization

Wei Zhang, Binglu Wang, Songhui Ma, Yani Zhang, Yongqiang Zhao

Research output: Contribution to journalArticlepeer-review

17 Scopus citations

Abstract

This paper focuses on two challenges for temporal action localization community, i.e., lack of long-term relationship and action pattern uncertainty. The former prevents the cooperation among multiple action instances within a video, while the latter may cause incomplete localizations or false positives. The lack of long-term relationship challenge results from the limited receptive field. Instead of stacking multiple layers or using large convolution kernels, we propose the intra-video attention mechanism to bring global receptive field to each temporal point. As for the action pattern uncertainty challenge, although it is hard to precisely depict the desired action pattern, paired videos that share the same action category can provide complementary information about action pattern. Consequently, we propose an inter-video attention mechanism to assist learning accurate action patterns. Based on the intra-video attention and inter-video attention, we propose a unified framework, namely I2Net, to tackle the challenging temporal action localization task. Given two videos containing sharing action categories, I2Net adopts the widely used one-stage action localization paradigm to dispose of them in parallel. As for two neighboring layers within the same video, the intra-video attention brings global information to each temporal point and helps to learn representative features. As for two parallel layers between two videos, the inter-video attention introduces complementary information to each video and helps to learn accurate action patterns. With the cooperation of intra-video and inter-video attention mechanisms, I2Net shows obvious performance gains over the baseline and builds new state-of-the-art on two widely-used benchmarks, i.e., THUMOS14 and ActivityNet v1.3.

Original languageEnglish
Pages (from-to)16-29
Number of pages14
JournalNeurocomputing
Volume444
DOIs
StatePublished - 15 Jul 2021

Keywords

  • Inter-video attention
  • Intra-video attention
  • Temporal action localization

Fingerprint

Dive into the research topics of 'I2Net: Mining intra-video and inter-video attention for temporal action localization'. Together they form a unique fingerprint.

Cite this