Structured Attention Composition for Temporal Action Localization

Le Yang; Junwei Han; Tao Zhao; Nian Liu; Dingwen Zhang

doi:10.1109/TIP.2022.3180925

Structured Attention Composition for Temporal Action Localization

Le Yang, Junwei Han, Tao Zhao, Nian Liu, Dingwen Zhang

自动化学院

科研成果: 期刊稿件 › 文章 › 同行评审

14 引用（Scopus）

摘要

Temporal action localization aims at localizing action instances from untrimmed videos. Existing works have designed various effective modules to precisely localize action instances based on appearance and motion features. However, by treating these two kinds of features with equal importance, previous works cannot take full advantage of each modality feature, making the learned model still sub-optimal. To tackle this issue, we make an early effort to study temporal action localization from the perspective of multi-modality feature learning, based on the observation that different actions exhibit specific preferences to appearance or motion modality. Specifically, we build a novel structured attention composition module. Unlike conventional attention, the proposed module would not infer frame attention and modality attention independently. Instead, by casting the relationship between the modality attention and the frame attention as an attention assignment process, the structured attention composition module learns to encode the frame-modality structure and uses it to regularize the inferred frame attention and modality attention, respectively, upon the optimal transport theory. The final frame-modality attention is obtained by the composition of the two individual attentions. The proposed structured attention composition module can be deployed as a plug-and-play module into existing action localization frameworks. Extensive experiments on two widely used benchmarks show that the proposed structured attention composition consistently improves four state-of-the-art temporal action localization methods and builds new state-of-the-art performance on THUMOS14.

源语言	英语
页（从-至）	1
页数	1
期刊	IEEE Transactions on Image Processing
DOI	https://doi.org/10.1109/TIP.2022.3180925
出版状态	已接受/待刊 - 2022

访问文件

10.1109/TIP.2022.3180925

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{650ddc72d1d448e7bac79fded335c947,

title = "Structured Attention Composition for Temporal Action Localization",

abstract = "Temporal action localization aims at localizing action instances from untrimmed videos. Existing works have designed various effective modules to precisely localize action instances based on appearance and motion features. However, by treating these two kinds of features with equal importance, previous works cannot take full advantage of each modality feature, making the learned model still sub-optimal. To tackle this issue, we make an early effort to study temporal action localization from the perspective of multi-modality feature learning, based on the observation that different actions exhibit specific preferences to appearance or motion modality. Specifically, we build a novel structured attention composition module. Unlike conventional attention, the proposed module would not infer frame attention and modality attention independently. Instead, by casting the relationship between the modality attention and the frame attention as an attention assignment process, the structured attention composition module learns to encode the frame-modality structure and uses it to regularize the inferred frame attention and modality attention, respectively, upon the optimal transport theory. The final frame-modality attention is obtained by the composition of the two individual attentions. The proposed structured attention composition module can be deployed as a plug-and-play module into existing action localization frameworks. Extensive experiments on two widely used benchmarks show that the proposed structured attention composition consistently improves four state-of-the-art temporal action localization methods and builds new state-of-the-art performance on THUMOS14.",

keywords = "Costs, Estimation, Location awareness, optimal transport, Receivers, Research and development, structured attention composition, Task analysis, Temporal action localization, Videos",

author = "Le Yang and Junwei Han and Tao Zhao and Nian Liu and Dingwen Zhang",

note = "Publisher Copyright: IEEE",

year = "2022",

doi = "10.1109/TIP.2022.3180925",

language = "英语",

pages = "1",

journal = "IEEE Transactions on Image Processing",

issn = "1057-7149",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Structured Attention Composition for Temporal Action Localization

AU - Yang, Le

AU - Han, Junwei

AU - Zhao, Tao

AU - Liu, Nian

AU - Zhang, Dingwen

N1 - Publisher Copyright: IEEE

PY - 2022

Y1 - 2022

N2 - Temporal action localization aims at localizing action instances from untrimmed videos. Existing works have designed various effective modules to precisely localize action instances based on appearance and motion features. However, by treating these two kinds of features with equal importance, previous works cannot take full advantage of each modality feature, making the learned model still sub-optimal. To tackle this issue, we make an early effort to study temporal action localization from the perspective of multi-modality feature learning, based on the observation that different actions exhibit specific preferences to appearance or motion modality. Specifically, we build a novel structured attention composition module. Unlike conventional attention, the proposed module would not infer frame attention and modality attention independently. Instead, by casting the relationship between the modality attention and the frame attention as an attention assignment process, the structured attention composition module learns to encode the frame-modality structure and uses it to regularize the inferred frame attention and modality attention, respectively, upon the optimal transport theory. The final frame-modality attention is obtained by the composition of the two individual attentions. The proposed structured attention composition module can be deployed as a plug-and-play module into existing action localization frameworks. Extensive experiments on two widely used benchmarks show that the proposed structured attention composition consistently improves four state-of-the-art temporal action localization methods and builds new state-of-the-art performance on THUMOS14.

AB - Temporal action localization aims at localizing action instances from untrimmed videos. Existing works have designed various effective modules to precisely localize action instances based on appearance and motion features. However, by treating these two kinds of features with equal importance, previous works cannot take full advantage of each modality feature, making the learned model still sub-optimal. To tackle this issue, we make an early effort to study temporal action localization from the perspective of multi-modality feature learning, based on the observation that different actions exhibit specific preferences to appearance or motion modality. Specifically, we build a novel structured attention composition module. Unlike conventional attention, the proposed module would not infer frame attention and modality attention independently. Instead, by casting the relationship between the modality attention and the frame attention as an attention assignment process, the structured attention composition module learns to encode the frame-modality structure and uses it to regularize the inferred frame attention and modality attention, respectively, upon the optimal transport theory. The final frame-modality attention is obtained by the composition of the two individual attentions. The proposed structured attention composition module can be deployed as a plug-and-play module into existing action localization frameworks. Extensive experiments on two widely used benchmarks show that the proposed structured attention composition consistently improves four state-of-the-art temporal action localization methods and builds new state-of-the-art performance on THUMOS14.

KW - Costs

KW - Estimation

KW - Location awareness

KW - optimal transport

KW - Receivers

KW - Research and development

KW - structured attention composition

KW - Task analysis

KW - Temporal action localization

KW - Videos

UR - http://www.scopus.com/inward/record.url?scp=85132725808&partnerID=8YFLogxK

U2 - 10.1109/TIP.2022.3180925

DO - 10.1109/TIP.2022.3180925

M3 - 文章

AN - SCOPUS:85132725808

SN - 1057-7149

SP - 1

JO - IEEE Transactions on Image Processing

JF - IEEE Transactions on Image Processing

ER -

Structured Attention Composition for Temporal Action Localization

摘要

访问文件

其它文件与链接

指纹

引用此