TY - JOUR
T1 - Cascade Multi-Level Transformer Network for Surgical Workflow Analysis
AU - Yue, Wenxi
AU - Liao, Hongen
AU - Xia, Yong
AU - Lam, Vincent
AU - Luo, Jiebo
AU - Wang, Zhiyong
N1 - Publisher Copyright:
© 1982-2012 IEEE.
PY - 2023/10/1
Y1 - 2023/10/1
N2 - Surgical workflow analysis aims to recognise surgical phases from untrimmed surgical videos. It is an integral component for enabling context-aware computer-aided surgical operating systems. Many deep learning-based methods have been developed for this task. However, most existing works aggregate homogeneous temporal context for all frames at a single level and neglect the fact that each frame has its specific need for information at multiple levels for accurate phase prediction. To fill this gap, in this paper we propose Cascade Multi-Level Transformer Network (CMTNet) composed of cascaded Adaptive Multi-Level Context Aggregation (AMCA) modules. Each AMCA module first extracts temporal context at the frame level and the phase level and then fuses frame-specific spatial feature, frame-level temporal context, and phase-level temporal context for each frame adaptively. By cascading multiple AMCA modules, CMTNet is able to gradually enrich the representation of each frame with the multi-level semantics that it specifically requires, achieving better phase prediction in a frame-adaptive manner. In addition, we propose a novel refinement loss for CMTNet, which explicitly guides each AMCA module to focus on extracting the key context for refining the prediction of the previous stage in terms of both prediction confidence and smoothness. This further enhances the quality of the extracted context effectively. Extensive experiments on the Cholec80 and the M2CAI datasets demonstrate that CMTNet achieves state-of-the-art performance.
AB - Surgical workflow analysis aims to recognise surgical phases from untrimmed surgical videos. It is an integral component for enabling context-aware computer-aided surgical operating systems. Many deep learning-based methods have been developed for this task. However, most existing works aggregate homogeneous temporal context for all frames at a single level and neglect the fact that each frame has its specific need for information at multiple levels for accurate phase prediction. To fill this gap, in this paper we propose Cascade Multi-Level Transformer Network (CMTNet) composed of cascaded Adaptive Multi-Level Context Aggregation (AMCA) modules. Each AMCA module first extracts temporal context at the frame level and the phase level and then fuses frame-specific spatial feature, frame-level temporal context, and phase-level temporal context for each frame adaptively. By cascading multiple AMCA modules, CMTNet is able to gradually enrich the representation of each frame with the multi-level semantics that it specifically requires, achieving better phase prediction in a frame-adaptive manner. In addition, we propose a novel refinement loss for CMTNet, which explicitly guides each AMCA module to focus on extracting the key context for refining the prediction of the previous stage in terms of both prediction confidence and smoothness. This further enhances the quality of the extracted context effectively. Extensive experiments on the Cholec80 and the M2CAI datasets demonstrate that CMTNet achieves state-of-the-art performance.
KW - Surgical phase recognition
KW - surgical workflow analysis
KW - temporal context aggregation
KW - transformer network
UR - http://www.scopus.com/inward/record.url?scp=85153365674&partnerID=8YFLogxK
U2 - 10.1109/TMI.2023.3265354
DO - 10.1109/TMI.2023.3265354
M3 - 文章
C2 - 37037257
AN - SCOPUS:85153365674
SN - 0278-0062
VL - 42
SP - 2817
EP - 2831
JO - IEEE Transactions on Medical Imaging
JF - IEEE Transactions on Medical Imaging
IS - 10
ER -