TY - GEN
T1 - Hand Action Recognition from RGB-D Egocentric Videos in Substations Operations and Maintenance
AU - Yao, Yiyang
AU - Wang, Xue
AU - Zhou, Guoqing
AU - Wang, Qing
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - This paper proposes a novel multimodal fusion network (MRDFNet) for egocentric hand action recognition from RGB-D videos. First, we utilize three separate streams to extract individual spatio-temporal features for different modalities, which include RGB frames, optical flow stacks, and depth frames. Particularly, for RGB and depth streams, an Attention-based Bidirectional Long Short-Term Memory network (Bi-LSTA) is used to identify regions of interest both spatially and temporally. Then, the extracted features are fed into a fusion module to obtain the integrated feature, which is finally used for egocentric hand action recognition. The fusion module is capable of learning complementary information from multiple modalities, i.e., preserving the distinctive property for each modality and meanwhile exploring the shareable property across modalities. Experimental results on both self-collected RGB-D Egocentric Manual Operation Dataset in Electrical Substations (REMOD-ES) and the THU-READ containing daily-life actions show the superiority of the proposed approach over state-of-the-art methods.
AB - This paper proposes a novel multimodal fusion network (MRDFNet) for egocentric hand action recognition from RGB-D videos. First, we utilize three separate streams to extract individual spatio-temporal features for different modalities, which include RGB frames, optical flow stacks, and depth frames. Particularly, for RGB and depth streams, an Attention-based Bidirectional Long Short-Term Memory network (Bi-LSTA) is used to identify regions of interest both spatially and temporally. Then, the extracted features are fed into a fusion module to obtain the integrated feature, which is finally used for egocentric hand action recognition. The fusion module is capable of learning complementary information from multiple modalities, i.e., preserving the distinctive property for each modality and meanwhile exploring the shareable property across modalities. Experimental results on both self-collected RGB-D Egocentric Manual Operation Dataset in Electrical Substations (REMOD-ES) and the THU-READ containing daily-life actions show the superiority of the proposed approach over state-of-the-art methods.
KW - attention mechanism
KW - egocentric video
KW - hand action recognition
KW - human-object interaction
KW - multimodal data
UR - http://www.scopus.com/inward/record.url?scp=85185765388&partnerID=8YFLogxK
U2 - 10.1109/ETFG55873.2023.10408532
DO - 10.1109/ETFG55873.2023.10408532
M3 - 会议稿件
AN - SCOPUS:85185765388
T3 - 2023 IEEE International Conference on Energy Technologies for Future Grids, ETFG 2023
BT - 2023 IEEE International Conference on Energy Technologies for Future Grids, ETFG 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2023 IEEE International Conference on Energy Technologies for Future Grids, ETFG 2023
Y2 - 3 December 2023 through 6 December 2023
ER -