One-shot Video Graph Generation for Explainable Action Reasoning

Yamin Han; Tao Zhuo; Peng Zhang; Wei Huang; Yufei Zha; Yanning Zhang; Mohan Kankanhalli

doi:10.1016/j.neucom.2022.02.069

One-shot Video Graph Generation for Explainable Action Reasoning

Yamin Han, Tao Zhuo, Peng Zhang, Wei Huang, Yufei Zha, Yanning Zhang, Mohan Kankanhalli

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

9 引用（Scopus）

摘要

Human action analysis is a critical yet challenging task for understanding diverse video content. Recently, to enable explainable reasoning of video actions, a spatio-temporal video graph structure was proposed to represent the video state changes at the semantic level. However, its requirement of tedious manual annotation of all the video frames is a serious limitation. Obviously, this approach would have a tremendously expanded applicability if the video graph generation process can be automated. In this paper, a One-Shot Video Graph (OSVG) generation approach is proposed for more effective explainable action reasoning, which only requires a one-time annotation of the objects in the starting frame of the video. We first estimate the predefined relevant objects across the temporal dimension by employing a proposed one-shot target-aware tracking strategy. This helps obtain the object locations and links objects simultaneously across all video frames. Then, the scene graph of each video frame can be constructed by an attribute detector and a relationship detector based on the estimated object locations. In addition, to further enhance the reasoning accuracy of performed actions, a video graph smoothing mechanism is designed with a fully-connected Conditional Random Field (CRF). By sequentially examining every state transition (including attributes and relationships) of the smoothed video graph, the actions occurring can be recognized using pre-defined rules. Experiments on the CAD-120++ dataset and a newly collected NTU RGBD++ dataset have verified that the proposed OSVG is able to outperform other state-of-the-art video action reasoning strategies on both state recognition and action recognition accuracy.

源语言	英语
页（从-至）	212-225
页数	14
期刊	Neurocomputing
卷	488
DOI	https://doi.org/10.1016/j.neucom.2022.02.069
出版状态	已出版 - 1 6月 2022

访问文件

10.1016/j.neucom.2022.02.069

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{2864834d64fc4a15b08f5210c8abec4e,

title = "One-shot Video Graph Generation for Explainable Action Reasoning",

abstract = "Human action analysis is a critical yet challenging task for understanding diverse video content. Recently, to enable explainable reasoning of video actions, a spatio-temporal video graph structure was proposed to represent the video state changes at the semantic level. However, its requirement of tedious manual annotation of all the video frames is a serious limitation. Obviously, this approach would have a tremendously expanded applicability if the video graph generation process can be automated. In this paper, a One-Shot Video Graph (OSVG) generation approach is proposed for more effective explainable action reasoning, which only requires a one-time annotation of the objects in the starting frame of the video. We first estimate the predefined relevant objects across the temporal dimension by employing a proposed one-shot target-aware tracking strategy. This helps obtain the object locations and links objects simultaneously across all video frames. Then, the scene graph of each video frame can be constructed by an attribute detector and a relationship detector based on the estimated object locations. In addition, to further enhance the reasoning accuracy of performed actions, a video graph smoothing mechanism is designed with a fully-connected Conditional Random Field (CRF). By sequentially examining every state transition (including attributes and relationships) of the smoothed video graph, the actions occurring can be recognized using pre-defined rules. Experiments on the CAD-120++ dataset and a newly collected NTU RGBD++ dataset have verified that the proposed OSVG is able to outperform other state-of-the-art video action reasoning strategies on both state recognition and action recognition accuracy.",

keywords = "Explainable action reasoning, Spatial-temporal scene graphs, State transition",

author = "Yamin Han and Tao Zhuo and Peng Zhang and Wei Huang and Yufei Zha and Yanning Zhang and Mohan Kankanhalli",

note = "Publisher Copyright: {\textcopyright} 2022",

year = "2022",

month = jun,

day = "1",

doi = "10.1016/j.neucom.2022.02.069",

language = "英语",

volume = "488",

pages = "212--225",

journal = "Neurocomputing",

issn = "0925-2312",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - One-shot Video Graph Generation for Explainable Action Reasoning

AU - Han, Yamin

AU - Zhuo, Tao

AU - Zhang, Peng

AU - Huang, Wei

AU - Zha, Yufei

AU - Zhang, Yanning

AU - Kankanhalli, Mohan

PY - 2022/6/1

Y1 - 2022/6/1

N2 - Human action analysis is a critical yet challenging task for understanding diverse video content. Recently, to enable explainable reasoning of video actions, a spatio-temporal video graph structure was proposed to represent the video state changes at the semantic level. However, its requirement of tedious manual annotation of all the video frames is a serious limitation. Obviously, this approach would have a tremendously expanded applicability if the video graph generation process can be automated. In this paper, a One-Shot Video Graph (OSVG) generation approach is proposed for more effective explainable action reasoning, which only requires a one-time annotation of the objects in the starting frame of the video. We first estimate the predefined relevant objects across the temporal dimension by employing a proposed one-shot target-aware tracking strategy. This helps obtain the object locations and links objects simultaneously across all video frames. Then, the scene graph of each video frame can be constructed by an attribute detector and a relationship detector based on the estimated object locations. In addition, to further enhance the reasoning accuracy of performed actions, a video graph smoothing mechanism is designed with a fully-connected Conditional Random Field (CRF). By sequentially examining every state transition (including attributes and relationships) of the smoothed video graph, the actions occurring can be recognized using pre-defined rules. Experiments on the CAD-120++ dataset and a newly collected NTU RGBD++ dataset have verified that the proposed OSVG is able to outperform other state-of-the-art video action reasoning strategies on both state recognition and action recognition accuracy.

AB - Human action analysis is a critical yet challenging task for understanding diverse video content. Recently, to enable explainable reasoning of video actions, a spatio-temporal video graph structure was proposed to represent the video state changes at the semantic level. However, its requirement of tedious manual annotation of all the video frames is a serious limitation. Obviously, this approach would have a tremendously expanded applicability if the video graph generation process can be automated. In this paper, a One-Shot Video Graph (OSVG) generation approach is proposed for more effective explainable action reasoning, which only requires a one-time annotation of the objects in the starting frame of the video. We first estimate the predefined relevant objects across the temporal dimension by employing a proposed one-shot target-aware tracking strategy. This helps obtain the object locations and links objects simultaneously across all video frames. Then, the scene graph of each video frame can be constructed by an attribute detector and a relationship detector based on the estimated object locations. In addition, to further enhance the reasoning accuracy of performed actions, a video graph smoothing mechanism is designed with a fully-connected Conditional Random Field (CRF). By sequentially examining every state transition (including attributes and relationships) of the smoothed video graph, the actions occurring can be recognized using pre-defined rules. Experiments on the CAD-120++ dataset and a newly collected NTU RGBD++ dataset have verified that the proposed OSVG is able to outperform other state-of-the-art video action reasoning strategies on both state recognition and action recognition accuracy.

KW - Explainable action reasoning

KW - Spatial-temporal scene graphs

KW - State transition

UR - http://www.scopus.com/inward/record.url?scp=85125959857&partnerID=8YFLogxK

U2 - 10.1016/j.neucom.2022.02.069

DO - 10.1016/j.neucom.2022.02.069

M3 - 文章

AN - SCOPUS:85125959857

SN - 0925-2312

VL - 488

SP - 212

EP - 225

JO - Neurocomputing

JF - Neurocomputing

ER -

One-shot Video Graph Generation for Explainable Action Reasoning

摘要

访问文件

其它文件与链接

指纹

引用此