TY - JOUR
T1 - Learning Video Salient Object Detection Progressively from Unlabeled Videos
AU - Xu, Binwei
AU - Jiang, Qiuping
AU - Liang, Haoran
AU - Zhang, Dingwen
AU - Liang, Ronghua
AU - Chen, Peng
N1 - Publisher Copyright:
© 1999-2012 IEEE.
PY - 2024
Y1 - 2024
N2 - Recently, deep learning-based video salient object detection (VSOD) has achieved some breakthroughs, but these methods rely on expensive annotated videos with pixel-wise annotations or weak annotations. In this paper, based on the similarities and differences between VSOD and image salient object detection (SOD), we propose a novel VSOD method via a progressive framework that locates and segments salient objects in sequence without utilizing any video annotation. To efficiently use the knowledge learned in the SOD dataset for VSOD efficiently, we introduce dynamic saliency to compensate for the lack of motion information of SOD during the locating process while maintaining the same fine segmenting process. Specifically, we utilize the coarse locating model trained on the image dataset, to identify frames with both static and dynamic saliency. Locating results of these frames are selected as spatiotemporal location labels. Moreover, by tracking salient objects in adjacent frames, the number of spatiotemporal location labels is increased. On the basis of these location labels, a two-stream locating network with an optical flow branch is proposed to capture salient objects in videos. The results with respect to five public benchmarks demonstrate that our method outperforms the state-of-the-art weakly and unsupervised methods.
AB - Recently, deep learning-based video salient object detection (VSOD) has achieved some breakthroughs, but these methods rely on expensive annotated videos with pixel-wise annotations or weak annotations. In this paper, based on the similarities and differences between VSOD and image salient object detection (SOD), we propose a novel VSOD method via a progressive framework that locates and segments salient objects in sequence without utilizing any video annotation. To efficiently use the knowledge learned in the SOD dataset for VSOD efficiently, we introduce dynamic saliency to compensate for the lack of motion information of SOD during the locating process while maintaining the same fine segmenting process. Specifically, we utilize the coarse locating model trained on the image dataset, to identify frames with both static and dynamic saliency. Locating results of these frames are selected as spatiotemporal location labels. Moreover, by tracking salient objects in adjacent frames, the number of spatiotemporal location labels is increased. On the basis of these location labels, a two-stream locating network with an optical flow branch is proposed to capture salient objects in videos. The results with respect to five public benchmarks demonstrate that our method outperforms the state-of-the-art weakly and unsupervised methods.
KW - Location
KW - Optical flow
KW - Segmentation
KW - Video salient object detection
KW - Weakly supervised learning
UR - http://www.scopus.com/inward/record.url?scp=85213715821&partnerID=8YFLogxK
U2 - 10.1109/TMM.2024.3521783
DO - 10.1109/TMM.2024.3521783
M3 - 文章
AN - SCOPUS:85213715821
SN - 1520-9210
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
ER -