TY - JOUR
T1 - Structured Attention Composition for Temporal Action Localization
AU - Yang, Le
AU - Han, Junwei
AU - Zhao, Tao
AU - Liu, Nian
AU - Zhang, Dingwen
N1 - Publisher Copyright:
IEEE
PY - 2022
Y1 - 2022
N2 - Temporal action localization aims at localizing action instances from untrimmed videos. Existing works have designed various effective modules to precisely localize action instances based on appearance and motion features. However, by treating these two kinds of features with equal importance, previous works cannot take full advantage of each modality feature, making the learned model still sub-optimal. To tackle this issue, we make an early effort to study temporal action localization from the perspective of multi-modality feature learning, based on the observation that different actions exhibit specific preferences to appearance or motion modality. Specifically, we build a novel structured attention composition module. Unlike conventional attention, the proposed module would not infer frame attention and modality attention independently. Instead, by casting the relationship between the modality attention and the frame attention as an attention assignment process, the structured attention composition module learns to encode the frame-modality structure and uses it to regularize the inferred frame attention and modality attention, respectively, upon the optimal transport theory. The final frame-modality attention is obtained by the composition of the two individual attentions. The proposed structured attention composition module can be deployed as a plug-and-play module into existing action localization frameworks. Extensive experiments on two widely used benchmarks show that the proposed structured attention composition consistently improves four state-of-the-art temporal action localization methods and builds new state-of-the-art performance on THUMOS14.
AB - Temporal action localization aims at localizing action instances from untrimmed videos. Existing works have designed various effective modules to precisely localize action instances based on appearance and motion features. However, by treating these two kinds of features with equal importance, previous works cannot take full advantage of each modality feature, making the learned model still sub-optimal. To tackle this issue, we make an early effort to study temporal action localization from the perspective of multi-modality feature learning, based on the observation that different actions exhibit specific preferences to appearance or motion modality. Specifically, we build a novel structured attention composition module. Unlike conventional attention, the proposed module would not infer frame attention and modality attention independently. Instead, by casting the relationship between the modality attention and the frame attention as an attention assignment process, the structured attention composition module learns to encode the frame-modality structure and uses it to regularize the inferred frame attention and modality attention, respectively, upon the optimal transport theory. The final frame-modality attention is obtained by the composition of the two individual attentions. The proposed structured attention composition module can be deployed as a plug-and-play module into existing action localization frameworks. Extensive experiments on two widely used benchmarks show that the proposed structured attention composition consistently improves four state-of-the-art temporal action localization methods and builds new state-of-the-art performance on THUMOS14.
KW - Costs
KW - Estimation
KW - Location awareness
KW - optimal transport
KW - Receivers
KW - Research and development
KW - structured attention composition
KW - Task analysis
KW - Temporal action localization
KW - Videos
UR - http://www.scopus.com/inward/record.url?scp=85132725808&partnerID=8YFLogxK
U2 - 10.1109/TIP.2022.3180925
DO - 10.1109/TIP.2022.3180925
M3 - 文章
AN - SCOPUS:85132725808
SN - 1057-7149
SP - 1
JO - IEEE Transactions on Image Processing
JF - IEEE Transactions on Image Processing
ER -