Cascade Multi-Level Transformer Network for Surgical Workflow Analysis

Wenxi Yue; Hongen Liao; Yong Xia; Vincent Lam; Jiebo Luo; Zhiyong Wang

doi:10.1109/TMI.2023.3265354

Cascade Multi-Level Transformer Network for Surgical Workflow Analysis

Wenxi Yue, Hongen Liao, Yong Xia, Vincent Lam, Jiebo Luo, Zhiyong Wang

School of Computer Science

Research output: Contribution to journal › Article › peer-review

7 Scopus citations

Abstract

Surgical workflow analysis aims to recognise surgical phases from untrimmed surgical videos. It is an integral component for enabling context-aware computer-aided surgical operating systems. Many deep learning-based methods have been developed for this task. However, most existing works aggregate homogeneous temporal context for all frames at a single level and neglect the fact that each frame has its specific need for information at multiple levels for accurate phase prediction. To fill this gap, in this paper we propose Cascade Multi-Level Transformer Network (CMTNet) composed of cascaded Adaptive Multi-Level Context Aggregation (AMCA) modules. Each AMCA module first extracts temporal context at the frame level and the phase level and then fuses frame-specific spatial feature, frame-level temporal context, and phase-level temporal context for each frame adaptively. By cascading multiple AMCA modules, CMTNet is able to gradually enrich the representation of each frame with the multi-level semantics that it specifically requires, achieving better phase prediction in a frame-adaptive manner. In addition, we propose a novel refinement loss for CMTNet, which explicitly guides each AMCA module to focus on extracting the key context for refining the prediction of the previous stage in terms of both prediction confidence and smoothness. This further enhances the quality of the extracted context effectively. Extensive experiments on the Cholec80 and the M2CAI datasets demonstrate that CMTNet achieves state-of-the-art performance.

Original language	English
Pages (from-to)	2817-2831
Number of pages	15
Journal	IEEE Transactions on Medical Imaging
Volume	42
Issue number	10
DOIs	https://doi.org/10.1109/TMI.2023.3265354
State	Published - 1 Oct 2023

Keywords

Surgical phase recognition
surgical workflow analysis
temporal context aggregation
transformer network

Access to Document

10.1109/TMI.2023.3265354

Cite this

@article{6d3d8e1d189a4d5aba0979321ee9f061,

title = "Cascade Multi-Level Transformer Network for Surgical Workflow Analysis",

abstract = "Surgical workflow analysis aims to recognise surgical phases from untrimmed surgical videos. It is an integral component for enabling context-aware computer-aided surgical operating systems. Many deep learning-based methods have been developed for this task. However, most existing works aggregate homogeneous temporal context for all frames at a single level and neglect the fact that each frame has its specific need for information at multiple levels for accurate phase prediction. To fill this gap, in this paper we propose Cascade Multi-Level Transformer Network (CMTNet) composed of cascaded Adaptive Multi-Level Context Aggregation (AMCA) modules. Each AMCA module first extracts temporal context at the frame level and the phase level and then fuses frame-specific spatial feature, frame-level temporal context, and phase-level temporal context for each frame adaptively. By cascading multiple AMCA modules, CMTNet is able to gradually enrich the representation of each frame with the multi-level semantics that it specifically requires, achieving better phase prediction in a frame-adaptive manner. In addition, we propose a novel refinement loss for CMTNet, which explicitly guides each AMCA module to focus on extracting the key context for refining the prediction of the previous stage in terms of both prediction confidence and smoothness. This further enhances the quality of the extracted context effectively. Extensive experiments on the Cholec80 and the M2CAI datasets demonstrate that CMTNet achieves state-of-the-art performance.",

keywords = "Surgical phase recognition, surgical workflow analysis, temporal context aggregation, transformer network",

author = "Wenxi Yue and Hongen Liao and Yong Xia and Vincent Lam and Jiebo Luo and Zhiyong Wang",

note = "Publisher Copyright: {\textcopyright} 1982-2012 IEEE.",

year = "2023",

month = oct,

day = "1",

doi = "10.1109/TMI.2023.3265354",

language = "英语",

volume = "42",

pages = "2817--2831",

journal = "IEEE Transactions on Medical Imaging",

issn = "0278-0062",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "10",

}

TY - JOUR

T1 - Cascade Multi-Level Transformer Network for Surgical Workflow Analysis

AU - Yue, Wenxi

AU - Liao, Hongen

AU - Xia, Yong

AU - Lam, Vincent

AU - Luo, Jiebo

AU - Wang, Zhiyong

PY - 2023/10/1

Y1 - 2023/10/1

N2 - Surgical workflow analysis aims to recognise surgical phases from untrimmed surgical videos. It is an integral component for enabling context-aware computer-aided surgical operating systems. Many deep learning-based methods have been developed for this task. However, most existing works aggregate homogeneous temporal context for all frames at a single level and neglect the fact that each frame has its specific need for information at multiple levels for accurate phase prediction. To fill this gap, in this paper we propose Cascade Multi-Level Transformer Network (CMTNet) composed of cascaded Adaptive Multi-Level Context Aggregation (AMCA) modules. Each AMCA module first extracts temporal context at the frame level and the phase level and then fuses frame-specific spatial feature, frame-level temporal context, and phase-level temporal context for each frame adaptively. By cascading multiple AMCA modules, CMTNet is able to gradually enrich the representation of each frame with the multi-level semantics that it specifically requires, achieving better phase prediction in a frame-adaptive manner. In addition, we propose a novel refinement loss for CMTNet, which explicitly guides each AMCA module to focus on extracting the key context for refining the prediction of the previous stage in terms of both prediction confidence and smoothness. This further enhances the quality of the extracted context effectively. Extensive experiments on the Cholec80 and the M2CAI datasets demonstrate that CMTNet achieves state-of-the-art performance.

AB - Surgical workflow analysis aims to recognise surgical phases from untrimmed surgical videos. It is an integral component for enabling context-aware computer-aided surgical operating systems. Many deep learning-based methods have been developed for this task. However, most existing works aggregate homogeneous temporal context for all frames at a single level and neglect the fact that each frame has its specific need for information at multiple levels for accurate phase prediction. To fill this gap, in this paper we propose Cascade Multi-Level Transformer Network (CMTNet) composed of cascaded Adaptive Multi-Level Context Aggregation (AMCA) modules. Each AMCA module first extracts temporal context at the frame level and the phase level and then fuses frame-specific spatial feature, frame-level temporal context, and phase-level temporal context for each frame adaptively. By cascading multiple AMCA modules, CMTNet is able to gradually enrich the representation of each frame with the multi-level semantics that it specifically requires, achieving better phase prediction in a frame-adaptive manner. In addition, we propose a novel refinement loss for CMTNet, which explicitly guides each AMCA module to focus on extracting the key context for refining the prediction of the previous stage in terms of both prediction confidence and smoothness. This further enhances the quality of the extracted context effectively. Extensive experiments on the Cholec80 and the M2CAI datasets demonstrate that CMTNet achieves state-of-the-art performance.

KW - Surgical phase recognition

KW - surgical workflow analysis

KW - temporal context aggregation

KW - transformer network

UR - http://www.scopus.com/inward/record.url?scp=85153365674&partnerID=8YFLogxK

U2 - 10.1109/TMI.2023.3265354

DO - 10.1109/TMI.2023.3265354

M3 - 文章

C2 - 37037257

AN - SCOPUS:85153365674

SN - 0278-0062

VL - 42

SP - 2817

EP - 2831

JO - IEEE Transactions on Medical Imaging

JF - IEEE Transactions on Medical Imaging

IS - 10

ER -

Cascade Multi-Level Transformer Network for Surgical Workflow Analysis

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this