Exploring Rich and Efficient Spatial Temporal Interactions for Real-Time Video Salient Object Detection

Chenglizhao Chen; Guotao Wang; Chong Peng; Yuming Fang; Dingwen Zhang; Hong Qin

doi:10.1109/TIP.2021.3068644

Exploring Rich and Efficient Spatial Temporal Interactions for Real-Time Video Salient Object Detection

Chenglizhao Chen, Guotao Wang, Chong Peng, Yuming Fang, Dingwen Zhang, Hong Qin

Research output: Contribution to journal › Article › peer-review

115 Scopus citations

Abstract

We have witnessed a growing interest in video salient object detection (VSOD) techniques in today's computer vision applications. In contrast with temporal information (which is still considered a rather unstable source thus far), the spatial information is more stable and ubiquitous, thus it could influence our vision system more. As a result, the current main-stream VSOD approaches have inferred and obtained their saliency primarily from the spatial perspective, still treating temporal information as subordinate. Although the aforementioned methodology of focusing on the spatial aspect is effective in achieving a numeric performance gain, it still has two critical limitations. First, to ensure the dominance by the spatial information, its temporal counterpart remains inadequately used, though in some complex video scenes, the temporal information may represent the only reliable data source, which is critical to derive the correct VSOD. Second, both spatial and temporal saliency cues are often computed independently in advance and then integrated later on, while the interactions between them are omitted completely, resulting in saliency cues with limited quality. To combat these challenges, this paper advocates a novel spatiotemporal network, where the key innovation is the design of its temporal unit. Compared with other existing competitors (e.g., convLSTM), the proposed temporal unit exhibits an extremely lightweight design that does not degrade its strong ability to sense temporal information. Furthermore, it fully enables the computation of temporal saliency cues that interact with their spatial counterparts, ultimately boosting the overall VSOD performance and realizing its full potential towards mutual performance improvement for each. The proposed method is easy to implement yet still effective, achieving high-quality VSOD at 50 FPS in real-Time applications.

Original language	English
Article number	9390381
Pages (from-to)	3995-4007
Number of pages	13
Journal	IEEE Transactions on Image Processing
Volume	30
DOIs	https://doi.org/10.1109/TIP.2021.3068644
State	Published - 2021
Externally published	Yes

Keywords

Video salient object detection
fast temporal shuffle
lightweight temporal unit
multiscale spatiotemporal deep features

Access to Document

10.1109/TIP.2021.3068644

Cite this

@article{5c219ee2bd6142929b0571b1a16d808a,

title = "Exploring Rich and Efficient Spatial Temporal Interactions for Real-Time Video Salient Object Detection",

abstract = "We have witnessed a growing interest in video salient object detection (VSOD) techniques in today's computer vision applications. In contrast with temporal information (which is still considered a rather unstable source thus far), the spatial information is more stable and ubiquitous, thus it could influence our vision system more. As a result, the current main-stream VSOD approaches have inferred and obtained their saliency primarily from the spatial perspective, still treating temporal information as subordinate. Although the aforementioned methodology of focusing on the spatial aspect is effective in achieving a numeric performance gain, it still has two critical limitations. First, to ensure the dominance by the spatial information, its temporal counterpart remains inadequately used, though in some complex video scenes, the temporal information may represent the only reliable data source, which is critical to derive the correct VSOD. Second, both spatial and temporal saliency cues are often computed independently in advance and then integrated later on, while the interactions between them are omitted completely, resulting in saliency cues with limited quality. To combat these challenges, this paper advocates a novel spatiotemporal network, where the key innovation is the design of its temporal unit. Compared with other existing competitors (e.g., convLSTM), the proposed temporal unit exhibits an extremely lightweight design that does not degrade its strong ability to sense temporal information. Furthermore, it fully enables the computation of temporal saliency cues that interact with their spatial counterparts, ultimately boosting the overall VSOD performance and realizing its full potential towards mutual performance improvement for each. The proposed method is easy to implement yet still effective, achieving high-quality VSOD at 50 FPS in real-Time applications.",

keywords = "Video salient object detection, fast temporal shuffle, lightweight temporal unit, multiscale spatiotemporal deep features",

author = "Chenglizhao Chen and Guotao Wang and Chong Peng and Yuming Fang and Dingwen Zhang and Hong Qin",

note = "Publisher Copyright: {\textcopyright} 1992-2012 IEEE.",

year = "2021",

doi = "10.1109/TIP.2021.3068644",

language = "英语",

volume = "30",

pages = "3995--4007",

journal = "IEEE Transactions on Image Processing",

issn = "1057-7149",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Exploring Rich and Efficient Spatial Temporal Interactions for Real-Time Video Salient Object Detection

AU - Chen, Chenglizhao

AU - Wang, Guotao

AU - Peng, Chong

AU - Fang, Yuming

AU - Zhang, Dingwen

AU - Qin, Hong

PY - 2021

Y1 - 2021

N2 - We have witnessed a growing interest in video salient object detection (VSOD) techniques in today's computer vision applications. In contrast with temporal information (which is still considered a rather unstable source thus far), the spatial information is more stable and ubiquitous, thus it could influence our vision system more. As a result, the current main-stream VSOD approaches have inferred and obtained their saliency primarily from the spatial perspective, still treating temporal information as subordinate. Although the aforementioned methodology of focusing on the spatial aspect is effective in achieving a numeric performance gain, it still has two critical limitations. First, to ensure the dominance by the spatial information, its temporal counterpart remains inadequately used, though in some complex video scenes, the temporal information may represent the only reliable data source, which is critical to derive the correct VSOD. Second, both spatial and temporal saliency cues are often computed independently in advance and then integrated later on, while the interactions between them are omitted completely, resulting in saliency cues with limited quality. To combat these challenges, this paper advocates a novel spatiotemporal network, where the key innovation is the design of its temporal unit. Compared with other existing competitors (e.g., convLSTM), the proposed temporal unit exhibits an extremely lightweight design that does not degrade its strong ability to sense temporal information. Furthermore, it fully enables the computation of temporal saliency cues that interact with their spatial counterparts, ultimately boosting the overall VSOD performance and realizing its full potential towards mutual performance improvement for each. The proposed method is easy to implement yet still effective, achieving high-quality VSOD at 50 FPS in real-Time applications.

AB - We have witnessed a growing interest in video salient object detection (VSOD) techniques in today's computer vision applications. In contrast with temporal information (which is still considered a rather unstable source thus far), the spatial information is more stable and ubiquitous, thus it could influence our vision system more. As a result, the current main-stream VSOD approaches have inferred and obtained their saliency primarily from the spatial perspective, still treating temporal information as subordinate. Although the aforementioned methodology of focusing on the spatial aspect is effective in achieving a numeric performance gain, it still has two critical limitations. First, to ensure the dominance by the spatial information, its temporal counterpart remains inadequately used, though in some complex video scenes, the temporal information may represent the only reliable data source, which is critical to derive the correct VSOD. Second, both spatial and temporal saliency cues are often computed independently in advance and then integrated later on, while the interactions between them are omitted completely, resulting in saliency cues with limited quality. To combat these challenges, this paper advocates a novel spatiotemporal network, where the key innovation is the design of its temporal unit. Compared with other existing competitors (e.g., convLSTM), the proposed temporal unit exhibits an extremely lightweight design that does not degrade its strong ability to sense temporal information. Furthermore, it fully enables the computation of temporal saliency cues that interact with their spatial counterparts, ultimately boosting the overall VSOD performance and realizing its full potential towards mutual performance improvement for each. The proposed method is easy to implement yet still effective, achieving high-quality VSOD at 50 FPS in real-Time applications.

KW - Video salient object detection

KW - fast temporal shuffle

KW - lightweight temporal unit

KW - multiscale spatiotemporal deep features

UR - http://www.scopus.com/inward/record.url?scp=85103758402&partnerID=8YFLogxK

U2 - 10.1109/TIP.2021.3068644

DO - 10.1109/TIP.2021.3068644

M3 - 文章

C2 - 33784620

AN - SCOPUS:85103758402

SN - 1057-7149

VL - 30

SP - 3995

EP - 4007

JO - IEEE Transactions on Image Processing

JF - IEEE Transactions on Image Processing

M1 - 9390381

ER -

Exploring Rich and Efficient Spatial Temporal Interactions for Real-Time Video Salient Object Detection

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this