Efficient spatiotemporal context modeling for action recognition

Congqi Cao; Yue Lu; Yifan Zhang; Dongmei Jiang; Yanning Zhang

doi:10.1016/j.neucom.2023.126289

Efficient spatiotemporal context modeling for action recognition

Congqi Cao, Yue Lu, Yifan Zhang, Dongmei Jiang, Yanning Zhang

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

4 引用（Scopus）

摘要

Contextual information is essential in action recognition. However, local operations have difficulty in modeling two distant elements, and directly computing the dense relations between any two points brings huge computation and memory burden. Inspired by the recurrent 2D criss-cross attention (RCCA-2D) in image segmentation, we propose a recurrent 3D criss-cross attention (RCCA-3D) that factorizes the global relation map into sparse relation maps to model long-range spatiotemporal context with minor costs for video-based action recognition. Specifically, we first propose a 3D criss-cross attention (CCA-3D) module. Compared with the CCA-2D which only works in space, it can capture the spatiotemporal relationship between the points in the same line along the direction of width, height and time. However, only replacing the two CCA-2Ds in the RCCA-2D with our CCA-3Ds cannot model the spatiotemporal context in videos. Therefore, we further duplicate the CCA-3D with a recurrent mechanism to transmit the relation between the points in a line to a plane and finally to the whole spatiotemporal space. To make the RCCA-3D adaptive for action recognition, we propose a novel recurrent structure rather than directly extending the original 2D structure to 3D. In the experiments, we make a thorough analysis of different structures of RCCA-3D, verifying the proposed structure is more suitable for action recognition. We also compare our RCCA-3D with the non-local attention, showing that the RCCA-3D requires 25% fewer parameters and 30% fewer FLOPs with even higher accuracy. Finally, equipped with our RCCA-3D, 3 networks achieve better and leading performance on 5 RGB-based and skeleton-based datasets.

源语言	英语
文章编号	126289
期刊	Neurocomputing
卷	545
DOI	https://doi.org/10.1016/j.neucom.2023.126289
出版状态	已出版 - 7 8月 2023

访问文件

10.1016/j.neucom.2023.126289

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{41fd404681134857a1a9b518a7d6e704,

title = "Efficient spatiotemporal context modeling for action recognition",

abstract = "Contextual information is essential in action recognition. However, local operations have difficulty in modeling two distant elements, and directly computing the dense relations between any two points brings huge computation and memory burden. Inspired by the recurrent 2D criss-cross attention (RCCA-2D) in image segmentation, we propose a recurrent 3D criss-cross attention (RCCA-3D) that factorizes the global relation map into sparse relation maps to model long-range spatiotemporal context with minor costs for video-based action recognition. Specifically, we first propose a 3D criss-cross attention (CCA-3D) module. Compared with the CCA-2D which only works in space, it can capture the spatiotemporal relationship between the points in the same line along the direction of width, height and time. However, only replacing the two CCA-2Ds in the RCCA-2D with our CCA-3Ds cannot model the spatiotemporal context in videos. Therefore, we further duplicate the CCA-3D with a recurrent mechanism to transmit the relation between the points in a line to a plane and finally to the whole spatiotemporal space. To make the RCCA-3D adaptive for action recognition, we propose a novel recurrent structure rather than directly extending the original 2D structure to 3D. In the experiments, we make a thorough analysis of different structures of RCCA-3D, verifying the proposed structure is more suitable for action recognition. We also compare our RCCA-3D with the non-local attention, showing that the RCCA-3D requires 25% fewer parameters and 30% fewer FLOPs with even higher accuracy. Finally, equipped with our RCCA-3D, 3 networks achieve better and leading performance on 5 RGB-based and skeleton-based datasets.",

keywords = "Action recognition, Attention module, Long-range context modeling, Relation, Spatiotemporal feature map",

author = "Congqi Cao and Yue Lu and Yifan Zhang and Dongmei Jiang and Yanning Zhang",

note = "Publisher Copyright: {\textcopyright} 2023 Elsevier B.V.",

year = "2023",

month = aug,

day = "7",

doi = "10.1016/j.neucom.2023.126289",

language = "英语",

volume = "545",

journal = "Neurocomputing",

issn = "0925-2312",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Efficient spatiotemporal context modeling for action recognition

AU - Cao, Congqi

AU - Lu, Yue

AU - Zhang, Yifan

AU - Jiang, Dongmei

AU - Zhang, Yanning

PY - 2023/8/7

Y1 - 2023/8/7

N2 - Contextual information is essential in action recognition. However, local operations have difficulty in modeling two distant elements, and directly computing the dense relations between any two points brings huge computation and memory burden. Inspired by the recurrent 2D criss-cross attention (RCCA-2D) in image segmentation, we propose a recurrent 3D criss-cross attention (RCCA-3D) that factorizes the global relation map into sparse relation maps to model long-range spatiotemporal context with minor costs for video-based action recognition. Specifically, we first propose a 3D criss-cross attention (CCA-3D) module. Compared with the CCA-2D which only works in space, it can capture the spatiotemporal relationship between the points in the same line along the direction of width, height and time. However, only replacing the two CCA-2Ds in the RCCA-2D with our CCA-3Ds cannot model the spatiotemporal context in videos. Therefore, we further duplicate the CCA-3D with a recurrent mechanism to transmit the relation between the points in a line to a plane and finally to the whole spatiotemporal space. To make the RCCA-3D adaptive for action recognition, we propose a novel recurrent structure rather than directly extending the original 2D structure to 3D. In the experiments, we make a thorough analysis of different structures of RCCA-3D, verifying the proposed structure is more suitable for action recognition. We also compare our RCCA-3D with the non-local attention, showing that the RCCA-3D requires 25% fewer parameters and 30% fewer FLOPs with even higher accuracy. Finally, equipped with our RCCA-3D, 3 networks achieve better and leading performance on 5 RGB-based and skeleton-based datasets.

AB - Contextual information is essential in action recognition. However, local operations have difficulty in modeling two distant elements, and directly computing the dense relations between any two points brings huge computation and memory burden. Inspired by the recurrent 2D criss-cross attention (RCCA-2D) in image segmentation, we propose a recurrent 3D criss-cross attention (RCCA-3D) that factorizes the global relation map into sparse relation maps to model long-range spatiotemporal context with minor costs for video-based action recognition. Specifically, we first propose a 3D criss-cross attention (CCA-3D) module. Compared with the CCA-2D which only works in space, it can capture the spatiotemporal relationship between the points in the same line along the direction of width, height and time. However, only replacing the two CCA-2Ds in the RCCA-2D with our CCA-3Ds cannot model the spatiotemporal context in videos. Therefore, we further duplicate the CCA-3D with a recurrent mechanism to transmit the relation between the points in a line to a plane and finally to the whole spatiotemporal space. To make the RCCA-3D adaptive for action recognition, we propose a novel recurrent structure rather than directly extending the original 2D structure to 3D. In the experiments, we make a thorough analysis of different structures of RCCA-3D, verifying the proposed structure is more suitable for action recognition. We also compare our RCCA-3D with the non-local attention, showing that the RCCA-3D requires 25% fewer parameters and 30% fewer FLOPs with even higher accuracy. Finally, equipped with our RCCA-3D, 3 networks achieve better and leading performance on 5 RGB-based and skeleton-based datasets.

KW - Action recognition

KW - Attention module

KW - Long-range context modeling

KW - Relation

KW - Spatiotemporal feature map

UR - http://www.scopus.com/inward/record.url?scp=85159374261&partnerID=8YFLogxK

U2 - 10.1016/j.neucom.2023.126289

DO - 10.1016/j.neucom.2023.126289

M3 - 文章

AN - SCOPUS:85159374261

SN - 0925-2312

VL - 545

JO - Neurocomputing

JF - Neurocomputing

M1 - 126289

ER -

Efficient spatiotemporal context modeling for action recognition

摘要

访问文件

其它文件与链接

指纹

引用此