TY - JOUR
T1 - Semantic-Guided Diffusion for Robust Multi-Object Tracking with Temporal Enhancement
AU - Li, Yuhao
AU - Zheng, Haotian
AU - Sun, Jinqiu
AU - Zhu, Yu
AU - Zhang, Yanning
N1 - Publisher Copyright:
© 1991-2012 IEEE.
PY - 2026
Y1 - 2026
N2 - Diffusion-based motion prediction methods have demonstrated strong capabilities in modeling nonlinear motion for multi-object tracking (MOT). However, in complex scenarios involving target interactions or occlusions, these methods still suffer from frequent identity switches and inaccurate trajectory predictions. This is primarily due to insufficient joint modeling of appearance and motion cues, as well as limited adaptability to diverse motion patterns. To address these challenges, we propose a semantic-guided diffusion-based method, termed SGDMOT, which jointly models target motion dynamics and identity consistency. Specifically, we leverage historical trajectories to query image-level global features, incorporating appearance and contextual information. These are fused with motion information via an attention mechanism, guiding the diffusion process to generate semantically consistent trajectories. Furthermore, we introduce a learnable multi-scale temporal modulation module that dynamically adjusts the encoding of diffusion time steps based on historical motion states. This enhances the model’s ability to adapt to motion variations across different temporal granularities, improving temporal modeling during the generation phase. Extensive experiments on the DanceTrack, MOT17, and MOT20 benchmarks demonstrate the effectiveness of our approach. Notably, on the DanceTrack test set, SGDMOT achieves an absolute gain of 2.3% in Higher Order Tracking Accuracy (HOTA) compared to a baseline diffusion model relying solely on motion features. Our code and pretrained models will be publicly released.
AB - Diffusion-based motion prediction methods have demonstrated strong capabilities in modeling nonlinear motion for multi-object tracking (MOT). However, in complex scenarios involving target interactions or occlusions, these methods still suffer from frequent identity switches and inaccurate trajectory predictions. This is primarily due to insufficient joint modeling of appearance and motion cues, as well as limited adaptability to diverse motion patterns. To address these challenges, we propose a semantic-guided diffusion-based method, termed SGDMOT, which jointly models target motion dynamics and identity consistency. Specifically, we leverage historical trajectories to query image-level global features, incorporating appearance and contextual information. These are fused with motion information via an attention mechanism, guiding the diffusion process to generate semantically consistent trajectories. Furthermore, we introduce a learnable multi-scale temporal modulation module that dynamically adjusts the encoding of diffusion time steps based on historical motion states. This enhances the model’s ability to adapt to motion variations across different temporal granularities, improving temporal modeling during the generation phase. Extensive experiments on the DanceTrack, MOT17, and MOT20 benchmarks demonstrate the effectiveness of our approach. Notably, on the DanceTrack test set, SGDMOT achieves an absolute gain of 2.3% in Higher Order Tracking Accuracy (HOTA) compared to a baseline diffusion model relying solely on motion features. Our code and pretrained models will be publicly released.
KW - Diffusion-based tracking
KW - multi-object tracking
KW - semantic-guided motion prediction
UR - https://www.scopus.com/pages/publications/105038714546
U2 - 10.1109/TCSVT.2026.3691397
DO - 10.1109/TCSVT.2026.3691397
M3 - 文章
AN - SCOPUS:105038714546
SN - 1051-8215
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
ER -