TY - JOUR
T1 - VS-TransGRU
T2 - A Novel Transformer-GRU-Based Framework Enhanced by Visual-Semantic Fusion for Egocentric Action Anticipation
AU - Cao, Congqi
AU - Sun, Ze
AU - Lv, Qinyi
AU - Min, Lingtong
AU - Zhang, Yanning
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Egocentric action anticipation is a challenging task that aims to make advanced predictions of future actions from current and historical observations in the first-person view. Most existing methods focus on improving the model architecture and loss function based on the visual input and recurrent neural network to boost the anticipation performance. However, these methods, which merely consider visual information and rely on a single network architecture, gradually reach a performance plateau. In order to fully understand what has been observed and capture the dependencies between current observations and future actions well enough, we propose a novel visual-semantic fusion enhanced and Transformer-GRU-based action anticipation framework in this paper. Firstly, high-level semantic information is introduced to improve the performance of action anticipation for the first time. We propose to use the semantic features generated based on the class labels or directly from the visual observations to augment the original visual features. Secondly, to take advantage of both the parallel and autoregressive models, we design a Transformer-based encoder for long-term sequential modeling and a GRU-based decoder for flexible iteration decoding. This hybrid architecture allows for better performance with fewer parameters and computations. Thirdly, an effective visual-semantic fusion module is proposed to make up for the semantic gap and fully utilize the complementarity of different modalities. Extensive experiments on two large-scale first-person view datasets and two third-person datasets validate the effectiveness of our proposed method, which achieves new state-of-the-art performance, outperforming previous approaches by a large margin. The code will be released after acceptance at https://github.com/sunze992/VS-TransGRU.
AB - Egocentric action anticipation is a challenging task that aims to make advanced predictions of future actions from current and historical observations in the first-person view. Most existing methods focus on improving the model architecture and loss function based on the visual input and recurrent neural network to boost the anticipation performance. However, these methods, which merely consider visual information and rely on a single network architecture, gradually reach a performance plateau. In order to fully understand what has been observed and capture the dependencies between current observations and future actions well enough, we propose a novel visual-semantic fusion enhanced and Transformer-GRU-based action anticipation framework in this paper. Firstly, high-level semantic information is introduced to improve the performance of action anticipation for the first time. We propose to use the semantic features generated based on the class labels or directly from the visual observations to augment the original visual features. Secondly, to take advantage of both the parallel and autoregressive models, we design a Transformer-based encoder for long-term sequential modeling and a GRU-based decoder for flexible iteration decoding. This hybrid architecture allows for better performance with fewer parameters and computations. Thirdly, an effective visual-semantic fusion module is proposed to make up for the semantic gap and fully utilize the complementarity of different modalities. Extensive experiments on two large-scale first-person view datasets and two third-person datasets validate the effectiveness of our proposed method, which achieves new state-of-the-art performance, outperforming previous approaches by a large margin. The code will be released after acceptance at https://github.com/sunze992/VS-TransGRU.
KW - Egocentric action anticipation
KW - GRU
KW - semantic
KW - transformer
KW - visual-semantic fusion
UR - http://www.scopus.com/inward/record.url?scp=85198316769&partnerID=8YFLogxK
U2 - 10.1109/TCSVT.2024.3425598
DO - 10.1109/TCSVT.2024.3425598
M3 - 文章
AN - SCOPUS:85198316769
SN - 1051-8215
VL - 34
SP - 11605
EP - 11618
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 11
ER -