TY - JOUR
T1 - POLO
T2 - Learning Explicit Cross-Modality Fusion for Temporal Action Localization
AU - Wang, Binglu
AU - Yang, Le
AU - Zhao, Yongqiang
N1 - Publisher Copyright:
© 1994-2012 IEEE.
PY - 2021
Y1 - 2021
N2 - Temporal action localization aims at discovering action instances in untrimmed videos, where RGB and flow are two widely used feature modalities. Specifically, RGB chiefly reveals appearance and flow mainly depicts motion. Given RGB and flow features, previous methods employ the early fusion or late fusion paradigm to mine the complementarity between them. By concatenating raw RGB and flow features, the early fusion implicitly achieved complementarity by the network, but it partly discards the particularity of each modality. The late fusion independently maintains two branches to explore the particularity of each modality, but it only fuses the localization results, which is insufficient to mine the complementarity. In this work, we propose explicit cross-modality fusion (POLO) to effectively utilize the complementarity between two modalities and thoroughly explore the particularity of each modality. POLO performs cross-modality fusion via estimating the attention weight from RGB modality and employing it to flow modality (vice versa). This bridges the complementarity of one modality to supply the other. Assisted with the attention weight, POLO independently learns from RGB and flow features and explores the particularity of each modality. Extensive experiments on two benchmarks demonstrate the preferable performance of POLO.
AB - Temporal action localization aims at discovering action instances in untrimmed videos, where RGB and flow are two widely used feature modalities. Specifically, RGB chiefly reveals appearance and flow mainly depicts motion. Given RGB and flow features, previous methods employ the early fusion or late fusion paradigm to mine the complementarity between them. By concatenating raw RGB and flow features, the early fusion implicitly achieved complementarity by the network, but it partly discards the particularity of each modality. The late fusion independently maintains two branches to explore the particularity of each modality, but it only fuses the localization results, which is insufficient to mine the complementarity. In this work, we propose explicit cross-modality fusion (POLO) to effectively utilize the complementarity between two modalities and thoroughly explore the particularity of each modality. POLO performs cross-modality fusion via estimating the attention weight from RGB modality and employing it to flow modality (vice versa). This bridges the complementarity of one modality to supply the other. Assisted with the attention weight, POLO independently learns from RGB and flow features and explores the particularity of each modality. Extensive experiments on two benchmarks demonstrate the preferable performance of POLO.
KW - Feature fusion
KW - frame-wise attention
KW - mutual attention
KW - temporal action localization
UR - http://www.scopus.com/inward/record.url?scp=85101787291&partnerID=8YFLogxK
U2 - 10.1109/LSP.2021.3061289
DO - 10.1109/LSP.2021.3061289
M3 - 文章
AN - SCOPUS:85101787291
SN - 1070-9908
VL - 28
SP - 503
EP - 507
JO - IEEE Signal Processing Letters
JF - IEEE Signal Processing Letters
M1 - 9362259
ER -