POLO: Learning Explicit Cross-Modality Fusion for Temporal Action Localization

Binglu Wang; Le Yang; Yongqiang Zhao

doi:10.1109/LSP.2021.3061289

POLO: Learning Explicit Cross-Modality Fusion for Temporal Action Localization

Binglu Wang, Le Yang, Yongqiang Zhao

自动化学院

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 文章 › 同行评审

28 引用（Scopus）

摘要

Temporal action localization aims at discovering action instances in untrimmed videos, where RGB and flow are two widely used feature modalities. Specifically, RGB chiefly reveals appearance and flow mainly depicts motion. Given RGB and flow features, previous methods employ the early fusion or late fusion paradigm to mine the complementarity between them. By concatenating raw RGB and flow features, the early fusion implicitly achieved complementarity by the network, but it partly discards the particularity of each modality. The late fusion independently maintains two branches to explore the particularity of each modality, but it only fuses the localization results, which is insufficient to mine the complementarity. In this work, we propose explicit cross-modality fusion (POLO) to effectively utilize the complementarity between two modalities and thoroughly explore the particularity of each modality. POLO performs cross-modality fusion via estimating the attention weight from RGB modality and employing it to flow modality (vice versa). This bridges the complementarity of one modality to supply the other. Assisted with the attention weight, POLO independently learns from RGB and flow features and explores the particularity of each modality. Extensive experiments on two benchmarks demonstrate the preferable performance of POLO.

源语言	英语
文章编号	9362259
页（从-至）	503-507
页数	5
期刊	IEEE Signal Processing Letters
卷	28
DOI	https://doi.org/10.1109/LSP.2021.3061289
出版状态	已出版 - 2021

访问文件

10.1109/LSP.2021.3061289

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{cb86d08c5fc542e6a20373a45926744d,

title = "POLO: Learning Explicit Cross-Modality Fusion for Temporal Action Localization",

abstract = "Temporal action localization aims at discovering action instances in untrimmed videos, where RGB and flow are two widely used feature modalities. Specifically, RGB chiefly reveals appearance and flow mainly depicts motion. Given RGB and flow features, previous methods employ the early fusion or late fusion paradigm to mine the complementarity between them. By concatenating raw RGB and flow features, the early fusion implicitly achieved complementarity by the network, but it partly discards the particularity of each modality. The late fusion independently maintains two branches to explore the particularity of each modality, but it only fuses the localization results, which is insufficient to mine the complementarity. In this work, we propose explicit cross-modality fusion (POLO) to effectively utilize the complementarity between two modalities and thoroughly explore the particularity of each modality. POLO performs cross-modality fusion via estimating the attention weight from RGB modality and employing it to flow modality (vice versa). This bridges the complementarity of one modality to supply the other. Assisted with the attention weight, POLO independently learns from RGB and flow features and explores the particularity of each modality. Extensive experiments on two benchmarks demonstrate the preferable performance of POLO.",

keywords = "Feature fusion, frame-wise attention, mutual attention, temporal action localization",

author = "Binglu Wang and Le Yang and Yongqiang Zhao",

note = "Publisher Copyright: {\textcopyright} 1994-2012 IEEE.",

year = "2021",

doi = "10.1109/LSP.2021.3061289",

language = "英语",

volume = "28",

pages = "503--507",

journal = "IEEE Signal Processing Letters",

issn = "1070-9908",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - POLO

T2 - Learning Explicit Cross-Modality Fusion for Temporal Action Localization

AU - Wang, Binglu

AU - Yang, Le

AU - Zhao, Yongqiang

PY - 2021

Y1 - 2021

N2 - Temporal action localization aims at discovering action instances in untrimmed videos, where RGB and flow are two widely used feature modalities. Specifically, RGB chiefly reveals appearance and flow mainly depicts motion. Given RGB and flow features, previous methods employ the early fusion or late fusion paradigm to mine the complementarity between them. By concatenating raw RGB and flow features, the early fusion implicitly achieved complementarity by the network, but it partly discards the particularity of each modality. The late fusion independently maintains two branches to explore the particularity of each modality, but it only fuses the localization results, which is insufficient to mine the complementarity. In this work, we propose explicit cross-modality fusion (POLO) to effectively utilize the complementarity between two modalities and thoroughly explore the particularity of each modality. POLO performs cross-modality fusion via estimating the attention weight from RGB modality and employing it to flow modality (vice versa). This bridges the complementarity of one modality to supply the other. Assisted with the attention weight, POLO independently learns from RGB and flow features and explores the particularity of each modality. Extensive experiments on two benchmarks demonstrate the preferable performance of POLO.

AB - Temporal action localization aims at discovering action instances in untrimmed videos, where RGB and flow are two widely used feature modalities. Specifically, RGB chiefly reveals appearance and flow mainly depicts motion. Given RGB and flow features, previous methods employ the early fusion or late fusion paradigm to mine the complementarity between them. By concatenating raw RGB and flow features, the early fusion implicitly achieved complementarity by the network, but it partly discards the particularity of each modality. The late fusion independently maintains two branches to explore the particularity of each modality, but it only fuses the localization results, which is insufficient to mine the complementarity. In this work, we propose explicit cross-modality fusion (POLO) to effectively utilize the complementarity between two modalities and thoroughly explore the particularity of each modality. POLO performs cross-modality fusion via estimating the attention weight from RGB modality and employing it to flow modality (vice versa). This bridges the complementarity of one modality to supply the other. Assisted with the attention weight, POLO independently learns from RGB and flow features and explores the particularity of each modality. Extensive experiments on two benchmarks demonstrate the preferable performance of POLO.

KW - Feature fusion

KW - frame-wise attention

KW - mutual attention

KW - temporal action localization

UR - http://www.scopus.com/inward/record.url?scp=85101787291&partnerID=8YFLogxK

U2 - 10.1109/LSP.2021.3061289

DO - 10.1109/LSP.2021.3061289

M3 - 文章

AN - SCOPUS:85101787291

SN - 1070-9908

VL - 28

SP - 503

EP - 507

JO - IEEE Signal Processing Letters

JF - IEEE Signal Processing Letters

M1 - 9362259

ER -

POLO: Learning Explicit Cross-Modality Fusion for Temporal Action Localization

摘要

访问文件

其它文件与链接

指纹

引用此