STMT: Spatio-temporal memory transformer for multi-object tracking

Songbo Gu; Jianxin Ma; Guancheng Hui; Qiyang Xiao; Wentao Shi

doi:10.1007/s10489-023-04617-1

STMT: Spatio-temporal memory transformer for multi-object tracking

Songbo Gu, Jianxin Ma, Guancheng Hui, Qiyang Xiao, Wentao Shi

Ocean Institute

Henan University

Research output: Contribution to journal › Article › peer-review

9 Scopus citations

Abstract

Typically, modern online Multi-Object Tracking (MOT) methods first obtain the detected objects in each frame and then establish associations between them in successive frames. However, it is difficult to obtain high-quality trajectories when camera motion, fast motion, and occlusion challenges occur. To address these problems, this paper proposes a transformer-based MOT system named Spatio-Temporal Memory Transformer (STMT), which focuses on time and history information. The proposed STMT consists of a Spatio-Temporal Enhancement Module (STEM) that uses 3D convolution to model the spatial and temporal interactions of objects and obtains rich features in spatio-temporal information. Moreover, a Dynamic Spatio-Temporal Memory (DSTM) is presented to associate detections with tracklets and contains three units: an Identity Aggregation Module (IAM), a Linear Dynamic Encoder (LD-Encoder) and a memory Decoder (Decoder). The IAM utilizes the geometric changes of objects to reduce the impact of deformation on tracking performance, the LD-Encoder is used to obtain the dependency between objects, and the Decoder generates appearance similarity scores. Furthermore, a Score Fusion Equilibrium Strategy (SFES) is employed to balance the similarity and position distance fusion scores. Extensive experiments demonstrate that the proposed STMT approach is generally superior to the state-of-the-art trackers on the MOT16 and MOT17 benchmarks.

Original language	English
Pages (from-to)	23426-23441
Number of pages	16
Journal	Applied Intelligence
Volume	53
Issue number	20
DOIs	https://doi.org/10.1007/s10489-023-04617-1
State	Published - Oct 2023

Keywords

Deep learning
Memory
Multi-object tracking
Spatio-temporal
Transformer

Access to Document

10.1007/s10489-023-04617-1

Cite this

@article{dc68099d0b8248f7b0520e9e9cfc14f9,

title = "STMT: Spatio-temporal memory transformer for multi-object tracking",

abstract = "Typically, modern online Multi-Object Tracking (MOT) methods first obtain the detected objects in each frame and then establish associations between them in successive frames. However, it is difficult to obtain high-quality trajectories when camera motion, fast motion, and occlusion challenges occur. To address these problems, this paper proposes a transformer-based MOT system named Spatio-Temporal Memory Transformer (STMT), which focuses on time and history information. The proposed STMT consists of a Spatio-Temporal Enhancement Module (STEM) that uses 3D convolution to model the spatial and temporal interactions of objects and obtains rich features in spatio-temporal information. Moreover, a Dynamic Spatio-Temporal Memory (DSTM) is presented to associate detections with tracklets and contains three units: an Identity Aggregation Module (IAM), a Linear Dynamic Encoder (LD-Encoder) and a memory Decoder (Decoder). The IAM utilizes the geometric changes of objects to reduce the impact of deformation on tracking performance, the LD-Encoder is used to obtain the dependency between objects, and the Decoder generates appearance similarity scores. Furthermore, a Score Fusion Equilibrium Strategy (SFES) is employed to balance the similarity and position distance fusion scores. Extensive experiments demonstrate that the proposed STMT approach is generally superior to the state-of-the-art trackers on the MOT16 and MOT17 benchmarks.",

keywords = "Deep learning, Memory, Multi-object tracking, Spatio-temporal, Transformer",

author = "Songbo Gu and Jianxin Ma and Guancheng Hui and Qiyang Xiao and Wentao Shi",

note = "Publisher Copyright: {\textcopyright} 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.",

year = "2023",

month = oct,

doi = "10.1007/s10489-023-04617-1",

language = "英语",

volume = "53",

pages = "23426--23441",

journal = "Applied Intelligence",

issn = "0924-669X",

publisher = "Springer Netherlands",

number = "20",

}

TY - JOUR

T1 - STMT

T2 - Spatio-temporal memory transformer for multi-object tracking

AU - Gu, Songbo

AU - Ma, Jianxin

AU - Hui, Guancheng

AU - Xiao, Qiyang

AU - Shi, Wentao

PY - 2023/10

Y1 - 2023/10

N2 - Typically, modern online Multi-Object Tracking (MOT) methods first obtain the detected objects in each frame and then establish associations between them in successive frames. However, it is difficult to obtain high-quality trajectories when camera motion, fast motion, and occlusion challenges occur. To address these problems, this paper proposes a transformer-based MOT system named Spatio-Temporal Memory Transformer (STMT), which focuses on time and history information. The proposed STMT consists of a Spatio-Temporal Enhancement Module (STEM) that uses 3D convolution to model the spatial and temporal interactions of objects and obtains rich features in spatio-temporal information. Moreover, a Dynamic Spatio-Temporal Memory (DSTM) is presented to associate detections with tracklets and contains three units: an Identity Aggregation Module (IAM), a Linear Dynamic Encoder (LD-Encoder) and a memory Decoder (Decoder). The IAM utilizes the geometric changes of objects to reduce the impact of deformation on tracking performance, the LD-Encoder is used to obtain the dependency between objects, and the Decoder generates appearance similarity scores. Furthermore, a Score Fusion Equilibrium Strategy (SFES) is employed to balance the similarity and position distance fusion scores. Extensive experiments demonstrate that the proposed STMT approach is generally superior to the state-of-the-art trackers on the MOT16 and MOT17 benchmarks.

AB - Typically, modern online Multi-Object Tracking (MOT) methods first obtain the detected objects in each frame and then establish associations between them in successive frames. However, it is difficult to obtain high-quality trajectories when camera motion, fast motion, and occlusion challenges occur. To address these problems, this paper proposes a transformer-based MOT system named Spatio-Temporal Memory Transformer (STMT), which focuses on time and history information. The proposed STMT consists of a Spatio-Temporal Enhancement Module (STEM) that uses 3D convolution to model the spatial and temporal interactions of objects and obtains rich features in spatio-temporal information. Moreover, a Dynamic Spatio-Temporal Memory (DSTM) is presented to associate detections with tracklets and contains three units: an Identity Aggregation Module (IAM), a Linear Dynamic Encoder (LD-Encoder) and a memory Decoder (Decoder). The IAM utilizes the geometric changes of objects to reduce the impact of deformation on tracking performance, the LD-Encoder is used to obtain the dependency between objects, and the Decoder generates appearance similarity scores. Furthermore, a Score Fusion Equilibrium Strategy (SFES) is employed to balance the similarity and position distance fusion scores. Extensive experiments demonstrate that the proposed STMT approach is generally superior to the state-of-the-art trackers on the MOT16 and MOT17 benchmarks.

KW - Deep learning

KW - Memory

KW - Multi-object tracking

KW - Spatio-temporal

KW - Transformer

UR - http://www.scopus.com/inward/record.url?scp=85164133920&partnerID=8YFLogxK

U2 - 10.1007/s10489-023-04617-1

DO - 10.1007/s10489-023-04617-1

M3 - 文章

AN - SCOPUS:85164133920

SN - 0924-669X

VL - 53

SP - 23426

EP - 23441

JO - Applied Intelligence

JF - Applied Intelligence

IS - 20

ER -

STMT: Spatio-temporal memory transformer for multi-object tracking

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this