TY - JOUR
T1 - One-shot Video Graph Generation for Explainable Action Reasoning
AU - Han, Yamin
AU - Zhuo, Tao
AU - Zhang, Peng
AU - Huang, Wei
AU - Zha, Yufei
AU - Zhang, Yanning
AU - Kankanhalli, Mohan
N1 - Publisher Copyright:
© 2022
PY - 2022/6/1
Y1 - 2022/6/1
N2 - Human action analysis is a critical yet challenging task for understanding diverse video content. Recently, to enable explainable reasoning of video actions, a spatio-temporal video graph structure was proposed to represent the video state changes at the semantic level. However, its requirement of tedious manual annotation of all the video frames is a serious limitation. Obviously, this approach would have a tremendously expanded applicability if the video graph generation process can be automated. In this paper, a One-Shot Video Graph (OSVG) generation approach is proposed for more effective explainable action reasoning, which only requires a one-time annotation of the objects in the starting frame of the video. We first estimate the predefined relevant objects across the temporal dimension by employing a proposed one-shot target-aware tracking strategy. This helps obtain the object locations and links objects simultaneously across all video frames. Then, the scene graph of each video frame can be constructed by an attribute detector and a relationship detector based on the estimated object locations. In addition, to further enhance the reasoning accuracy of performed actions, a video graph smoothing mechanism is designed with a fully-connected Conditional Random Field (CRF). By sequentially examining every state transition (including attributes and relationships) of the smoothed video graph, the actions occurring can be recognized using pre-defined rules. Experiments on the CAD-120++ dataset and a newly collected NTU RGBD++ dataset have verified that the proposed OSVG is able to outperform other state-of-the-art video action reasoning strategies on both state recognition and action recognition accuracy.
AB - Human action analysis is a critical yet challenging task for understanding diverse video content. Recently, to enable explainable reasoning of video actions, a spatio-temporal video graph structure was proposed to represent the video state changes at the semantic level. However, its requirement of tedious manual annotation of all the video frames is a serious limitation. Obviously, this approach would have a tremendously expanded applicability if the video graph generation process can be automated. In this paper, a One-Shot Video Graph (OSVG) generation approach is proposed for more effective explainable action reasoning, which only requires a one-time annotation of the objects in the starting frame of the video. We first estimate the predefined relevant objects across the temporal dimension by employing a proposed one-shot target-aware tracking strategy. This helps obtain the object locations and links objects simultaneously across all video frames. Then, the scene graph of each video frame can be constructed by an attribute detector and a relationship detector based on the estimated object locations. In addition, to further enhance the reasoning accuracy of performed actions, a video graph smoothing mechanism is designed with a fully-connected Conditional Random Field (CRF). By sequentially examining every state transition (including attributes and relationships) of the smoothed video graph, the actions occurring can be recognized using pre-defined rules. Experiments on the CAD-120++ dataset and a newly collected NTU RGBD++ dataset have verified that the proposed OSVG is able to outperform other state-of-the-art video action reasoning strategies on both state recognition and action recognition accuracy.
KW - Explainable action reasoning
KW - Spatial-temporal scene graphs
KW - State transition
UR - http://www.scopus.com/inward/record.url?scp=85125959857&partnerID=8YFLogxK
U2 - 10.1016/j.neucom.2022.02.069
DO - 10.1016/j.neucom.2022.02.069
M3 - 文章
AN - SCOPUS:85125959857
SN - 0925-2312
VL - 488
SP - 212
EP - 225
JO - Neurocomputing
JF - Neurocomputing
ER -