VS-TransGRU: A Novel Transformer-GRU-Based Framework Enhanced by Visual-Semantic Fusion for Egocentric Action Anticipation

Congqi Cao; Ze Sun; Qinyi Lv; Lingtong Min; Yanning Zhang

doi:10.1109/TCSVT.2024.3425598

VS-TransGRU: A Novel Transformer-GRU-Based Framework Enhanced by Visual-Semantic Fusion for Egocentric Action Anticipation

Congqi Cao, Ze Sun, Qinyi Lv, Lingtong Min, Yanning Zhang

School of Computer Science

Northwestern Polytechnical University Xian

Research output: Contribution to journal › Article › peer-review

1 Scopus citations

Abstract

Egocentric action anticipation is a challenging task that aims to make advanced predictions of future actions from current and historical observations in the first-person view. Most existing methods focus on improving the model architecture and loss function based on the visual input and recurrent neural network to boost the anticipation performance. However, these methods, which merely consider visual information and rely on a single network architecture, gradually reach a performance plateau. In order to fully understand what has been observed and capture the dependencies between current observations and future actions well enough, we propose a novel visual-semantic fusion enhanced and Transformer-GRU-based action anticipation framework in this paper. Firstly, high-level semantic information is introduced to improve the performance of action anticipation for the first time. We propose to use the semantic features generated based on the class labels or directly from the visual observations to augment the original visual features. Secondly, to take advantage of both the parallel and autoregressive models, we design a Transformer-based encoder for long-term sequential modeling and a GRU-based decoder for flexible iteration decoding. This hybrid architecture allows for better performance with fewer parameters and computations. Thirdly, an effective visual-semantic fusion module is proposed to make up for the semantic gap and fully utilize the complementarity of different modalities. Extensive experiments on two large-scale first-person view datasets and two third-person datasets validate the effectiveness of our proposed method, which achieves new state-of-the-art performance, outperforming previous approaches by a large margin. The code will be released after acceptance at https://github.com/sunze992/VS-TransGRU.

Original language	English
Pages (from-to)	11605-11618
Number of pages	14
Journal	IEEE Transactions on Circuits and Systems for Video Technology
Volume	34
Issue number	11
DOIs	https://doi.org/10.1109/TCSVT.2024.3425598
State	Published - 2024

Keywords

Egocentric action anticipation
GRU
semantic
transformer
visual-semantic fusion

Access to Document

10.1109/TCSVT.2024.3425598

Cite this

@article{ef19e2bfc6964afd846b4fcfe4352601,

title = "VS-TransGRU: A Novel Transformer-GRU-Based Framework Enhanced by Visual-Semantic Fusion for Egocentric Action Anticipation",

abstract = "Egocentric action anticipation is a challenging task that aims to make advanced predictions of future actions from current and historical observations in the first-person view. Most existing methods focus on improving the model architecture and loss function based on the visual input and recurrent neural network to boost the anticipation performance. However, these methods, which merely consider visual information and rely on a single network architecture, gradually reach a performance plateau. In order to fully understand what has been observed and capture the dependencies between current observations and future actions well enough, we propose a novel visual-semantic fusion enhanced and Transformer-GRU-based action anticipation framework in this paper. Firstly, high-level semantic information is introduced to improve the performance of action anticipation for the first time. We propose to use the semantic features generated based on the class labels or directly from the visual observations to augment the original visual features. Secondly, to take advantage of both the parallel and autoregressive models, we design a Transformer-based encoder for long-term sequential modeling and a GRU-based decoder for flexible iteration decoding. This hybrid architecture allows for better performance with fewer parameters and computations. Thirdly, an effective visual-semantic fusion module is proposed to make up for the semantic gap and fully utilize the complementarity of different modalities. Extensive experiments on two large-scale first-person view datasets and two third-person datasets validate the effectiveness of our proposed method, which achieves new state-of-the-art performance, outperforming previous approaches by a large margin. The code will be released after acceptance at https://github.com/sunze992/VS-TransGRU.",

keywords = "Egocentric action anticipation, GRU, semantic, transformer, visual-semantic fusion",

author = "Congqi Cao and Ze Sun and Qinyi Lv and Lingtong Min and Yanning Zhang",

note = "Publisher Copyright: {\textcopyright} 2024 IEEE.",

year = "2024",

doi = "10.1109/TCSVT.2024.3425598",

language = "英语",

volume = "34",

pages = "11605--11618",

journal = "IEEE Transactions on Circuits and Systems for Video Technology",

issn = "1051-8215",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "11",

}

TY - JOUR

T1 - VS-TransGRU

T2 - A Novel Transformer-GRU-Based Framework Enhanced by Visual-Semantic Fusion for Egocentric Action Anticipation

AU - Cao, Congqi

AU - Sun, Ze

AU - Lv, Qinyi

AU - Min, Lingtong

AU - Zhang, Yanning

PY - 2024

Y1 - 2024

N2 - Egocentric action anticipation is a challenging task that aims to make advanced predictions of future actions from current and historical observations in the first-person view. Most existing methods focus on improving the model architecture and loss function based on the visual input and recurrent neural network to boost the anticipation performance. However, these methods, which merely consider visual information and rely on a single network architecture, gradually reach a performance plateau. In order to fully understand what has been observed and capture the dependencies between current observations and future actions well enough, we propose a novel visual-semantic fusion enhanced and Transformer-GRU-based action anticipation framework in this paper. Firstly, high-level semantic information is introduced to improve the performance of action anticipation for the first time. We propose to use the semantic features generated based on the class labels or directly from the visual observations to augment the original visual features. Secondly, to take advantage of both the parallel and autoregressive models, we design a Transformer-based encoder for long-term sequential modeling and a GRU-based decoder for flexible iteration decoding. This hybrid architecture allows for better performance with fewer parameters and computations. Thirdly, an effective visual-semantic fusion module is proposed to make up for the semantic gap and fully utilize the complementarity of different modalities. Extensive experiments on two large-scale first-person view datasets and two third-person datasets validate the effectiveness of our proposed method, which achieves new state-of-the-art performance, outperforming previous approaches by a large margin. The code will be released after acceptance at https://github.com/sunze992/VS-TransGRU.

AB - Egocentric action anticipation is a challenging task that aims to make advanced predictions of future actions from current and historical observations in the first-person view. Most existing methods focus on improving the model architecture and loss function based on the visual input and recurrent neural network to boost the anticipation performance. However, these methods, which merely consider visual information and rely on a single network architecture, gradually reach a performance plateau. In order to fully understand what has been observed and capture the dependencies between current observations and future actions well enough, we propose a novel visual-semantic fusion enhanced and Transformer-GRU-based action anticipation framework in this paper. Firstly, high-level semantic information is introduced to improve the performance of action anticipation for the first time. We propose to use the semantic features generated based on the class labels or directly from the visual observations to augment the original visual features. Secondly, to take advantage of both the parallel and autoregressive models, we design a Transformer-based encoder for long-term sequential modeling and a GRU-based decoder for flexible iteration decoding. This hybrid architecture allows for better performance with fewer parameters and computations. Thirdly, an effective visual-semantic fusion module is proposed to make up for the semantic gap and fully utilize the complementarity of different modalities. Extensive experiments on two large-scale first-person view datasets and two third-person datasets validate the effectiveness of our proposed method, which achieves new state-of-the-art performance, outperforming previous approaches by a large margin. The code will be released after acceptance at https://github.com/sunze992/VS-TransGRU.

KW - Egocentric action anticipation

KW - GRU

KW - semantic

KW - transformer

KW - visual-semantic fusion

UR - http://www.scopus.com/inward/record.url?scp=85198316769&partnerID=8YFLogxK

U2 - 10.1109/TCSVT.2024.3425598

DO - 10.1109/TCSVT.2024.3425598

M3 - 文章

AN - SCOPUS:85198316769

SN - 1051-8215

VL - 34

SP - 11605

EP - 11618

JO - IEEE Transactions on Circuits and Systems for Video Technology

JF - IEEE Transactions on Circuits and Systems for Video Technology

IS - 11

ER -

VS-TransGRU: A Novel Transformer-GRU-Based Framework Enhanced by Visual-Semantic Fusion for Egocentric Action Anticipation

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this