Click-level supervision for online action detection extended from SCOAD

Xing Zhang; Yuhan Mei; Ye Na; Xia Ling Lin; Genqing Bian; Qingsen Yan; Ghulam Mohi-ud-din; Chen Ai; Zhou Li; Wei Dong

doi:10.1016/j.future.2024.107668

Click-level supervision for online action detection extended from SCOAD

Xing Zhang, Yuhan Mei, Ye Na, Xia Ling Lin, Genqing Bian, Qingsen Yan, Ghulam Mohi-ud-din, Chen Ai, Zhou Li, Wei Dong

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

Data-driven fully-supervised online action detection algorithms heavily rely on manual annotations, which are challenging to obtain in real-world applications. Current research efforts aim to address this issue by introducing weakly supervised online action detection (WOAD) methods that utilize video-level annotations. However, these approaches frequently face challenges with blurred temporal boundaries, stemming from the lack of explicit temporal information. In this work, we revisit WOAD and propose an algorithm for weakly supervised online action detection using click-level annotations, which we call Single-frame Click Supervision for Online Action Detection (SCOAD). SCOAD stands out by significantly improving prediction accuracy without substantially increasing the annotation cost. This improvement is achieved through a set of well-engineered loss functions that leverage the limited temporal information provided by click labels. Additionally, we present an enhanced version of SCOAD called SCOAD++. It introduces a novel mechanism that enhances the model's ability to utilize historical information and significantly refines detail differentiation, addressing the limitations of traditional fully connected frameworks that neglect temporal variations. Furthermore, to explore the issue of accuracy variation caused by inherent randomness in click-level annotation, we have constructed a human fitness video dataset for this study. On the other hand, we also reveal the limitations of video-level labels in the field of action detection with this well-constructed dataset. We perform extensive experiments on numerous benchmark datasets and demonstrate that our approach outperforms state-of-the-art methods.

源语言	英语
文章编号	107668
期刊	Future Generation Computer Systems
卷	166
DOI	https://doi.org/10.1016/j.future.2024.107668
出版状态	已出版 - 5月 2025

访问文件

10.1016/j.future.2024.107668

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{b6003377fc6f4be9ac296a38279b1c51,

title = "Click-level supervision for online action detection extended from SCOAD",

abstract = "Data-driven fully-supervised online action detection algorithms heavily rely on manual annotations, which are challenging to obtain in real-world applications. Current research efforts aim to address this issue by introducing weakly supervised online action detection (WOAD) methods that utilize video-level annotations. However, these approaches frequently face challenges with blurred temporal boundaries, stemming from the lack of explicit temporal information. In this work, we revisit WOAD and propose an algorithm for weakly supervised online action detection using click-level annotations, which we call Single-frame Click Supervision for Online Action Detection (SCOAD). SCOAD stands out by significantly improving prediction accuracy without substantially increasing the annotation cost. This improvement is achieved through a set of well-engineered loss functions that leverage the limited temporal information provided by click labels. Additionally, we present an enhanced version of SCOAD called SCOAD++. It introduces a novel mechanism that enhances the model's ability to utilize historical information and significantly refines detail differentiation, addressing the limitations of traditional fully connected frameworks that neglect temporal variations. Furthermore, to explore the issue of accuracy variation caused by inherent randomness in click-level annotation, we have constructed a human fitness video dataset for this study. On the other hand, we also reveal the limitations of video-level labels in the field of action detection with this well-constructed dataset. We perform extensive experiments on numerous benchmark datasets and demonstrate that our approach outperforms state-of-the-art methods.",

keywords = "Computer vision, Online action detection, Video understanding, Weakly supervised learning",

author = "Xing Zhang and Yuhan Mei and Ye Na and Lin, {Xia Ling} and Genqing Bian and Qingsen Yan and Ghulam Mohi-ud-din and Chen Ai and Zhou Li and Wei Dong",

note = "Publisher Copyright: {\textcopyright} 2024 Elsevier B.V.",

year = "2025",

month = may,

doi = "10.1016/j.future.2024.107668",

language = "英语",

volume = "166",

journal = "Future Generation Computer Systems",

issn = "0167-739X",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Click-level supervision for online action detection extended from SCOAD

AU - Zhang, Xing

AU - Mei, Yuhan

AU - Na, Ye

AU - Lin, Xia Ling

AU - Bian, Genqing

AU - Yan, Qingsen

AU - Mohi-ud-din, Ghulam

AU - Ai, Chen

AU - Li, Zhou

AU - Dong, Wei

PY - 2025/5

Y1 - 2025/5

N2 - Data-driven fully-supervised online action detection algorithms heavily rely on manual annotations, which are challenging to obtain in real-world applications. Current research efforts aim to address this issue by introducing weakly supervised online action detection (WOAD) methods that utilize video-level annotations. However, these approaches frequently face challenges with blurred temporal boundaries, stemming from the lack of explicit temporal information. In this work, we revisit WOAD and propose an algorithm for weakly supervised online action detection using click-level annotations, which we call Single-frame Click Supervision for Online Action Detection (SCOAD). SCOAD stands out by significantly improving prediction accuracy without substantially increasing the annotation cost. This improvement is achieved through a set of well-engineered loss functions that leverage the limited temporal information provided by click labels. Additionally, we present an enhanced version of SCOAD called SCOAD++. It introduces a novel mechanism that enhances the model's ability to utilize historical information and significantly refines detail differentiation, addressing the limitations of traditional fully connected frameworks that neglect temporal variations. Furthermore, to explore the issue of accuracy variation caused by inherent randomness in click-level annotation, we have constructed a human fitness video dataset for this study. On the other hand, we also reveal the limitations of video-level labels in the field of action detection with this well-constructed dataset. We perform extensive experiments on numerous benchmark datasets and demonstrate that our approach outperforms state-of-the-art methods.

AB - Data-driven fully-supervised online action detection algorithms heavily rely on manual annotations, which are challenging to obtain in real-world applications. Current research efforts aim to address this issue by introducing weakly supervised online action detection (WOAD) methods that utilize video-level annotations. However, these approaches frequently face challenges with blurred temporal boundaries, stemming from the lack of explicit temporal information. In this work, we revisit WOAD and propose an algorithm for weakly supervised online action detection using click-level annotations, which we call Single-frame Click Supervision for Online Action Detection (SCOAD). SCOAD stands out by significantly improving prediction accuracy without substantially increasing the annotation cost. This improvement is achieved through a set of well-engineered loss functions that leverage the limited temporal information provided by click labels. Additionally, we present an enhanced version of SCOAD called SCOAD++. It introduces a novel mechanism that enhances the model's ability to utilize historical information and significantly refines detail differentiation, addressing the limitations of traditional fully connected frameworks that neglect temporal variations. Furthermore, to explore the issue of accuracy variation caused by inherent randomness in click-level annotation, we have constructed a human fitness video dataset for this study. On the other hand, we also reveal the limitations of video-level labels in the field of action detection with this well-constructed dataset. We perform extensive experiments on numerous benchmark datasets and demonstrate that our approach outperforms state-of-the-art methods.

KW - Computer vision

KW - Online action detection

KW - Video understanding

KW - Weakly supervised learning

UR - http://www.scopus.com/inward/record.url?scp=85213212015&partnerID=8YFLogxK

U2 - 10.1016/j.future.2024.107668

DO - 10.1016/j.future.2024.107668

M3 - 文章

AN - SCOPUS:85213212015

SN - 0167-739X

VL - 166

JO - Future Generation Computer Systems

JF - Future Generation Computer Systems

M1 - 107668

ER -

Click-level supervision for online action detection extended from SCOAD

摘要

访问文件

其它文件与链接

指纹

引用此