Segmentation in Weakly Labeled Videos via a Semantic Ranking and Optical Warping Network

Le Yang; Junwei Han; Dingwen Zhang; Nian Liu; Dong Zhang

doi:10.1109/TIP.2018.2834221

Segmentation in Weakly Labeled Videos via a Semantic Ranking and Optical Warping Network

Le Yang, Junwei Han, Dingwen Zhang, Nian Liu, Dong Zhang

School of Automation

Northwestern Polytechnical University Xian

Research output: Contribution to journal › Article › peer-review

20 Scopus citations

Abstract

Weakly supervised video object segmentation (WSVOS) focuses on generating pixel-level object masks for videos only tagged with class labels, which is an essential yet challenging task. For WSVOS, the algorithm is just aware of rough category information rather than the concrete object size and location cues, besides it lacks reliable annotated exemplars to learn temporal evolution in the investigated videos. Basically, there are three challenging factors which may influence the performance of WSVOS: foreground object discovery in each frame, coarse object semantic consistency within each video, and fine-grained segmentation smoothness within neighbor frames. In this paper, we establish a semantic ranking and optical warping network to simultaneously solve these three challenges in a unified framework. For the first challenge, we apply the still image saliency detection method and discover the foreground object for each frame via a segmentation network. Due to the huge discrepancies between the image saliency and the video object segmentation, we step further and propose two subnetworks to solve the other two challenges. For the second one, we propose an attentive semantic ranking subnetwork to mine video-level tags, which can learn discriminative features for semantic ranking and lead to semantic consistent segmentation masks. For the third one, we propose an optical flow warping subnetwork to constrain fine-grained segmentation smoothness within neighbor frames, which can suppress the large deformation and thus obtain smooth object boundaries for adjacent frames. Experiments on two benchmark data sets, i.e., DAVIS data set and YouTube-Objects data set, demonstrate the effectiveness of the proposed approach for segmenting out video objects under weak supervision.

Original language	English
Pages (from-to)	4025-4037
Number of pages	13
Journal	IEEE Transactions on Image Processing
Volume	27
Issue number	8
DOIs	https://doi.org/10.1109/TIP.2018.2834221
State	Published - Aug 2018

Keywords

optical warping
semantic ranking
Video object segmentation
weak supervision

Access to Document

10.1109/TIP.2018.2834221

Cite this

@article{f6facb9276074a6d97942c7d83978a14,

title = "Segmentation in Weakly Labeled Videos via a Semantic Ranking and Optical Warping Network",

abstract = "Weakly supervised video object segmentation (WSVOS) focuses on generating pixel-level object masks for videos only tagged with class labels, which is an essential yet challenging task. For WSVOS, the algorithm is just aware of rough category information rather than the concrete object size and location cues, besides it lacks reliable annotated exemplars to learn temporal evolution in the investigated videos. Basically, there are three challenging factors which may influence the performance of WSVOS: foreground object discovery in each frame, coarse object semantic consistency within each video, and fine-grained segmentation smoothness within neighbor frames. In this paper, we establish a semantic ranking and optical warping network to simultaneously solve these three challenges in a unified framework. For the first challenge, we apply the still image saliency detection method and discover the foreground object for each frame via a segmentation network. Due to the huge discrepancies between the image saliency and the video object segmentation, we step further and propose two subnetworks to solve the other two challenges. For the second one, we propose an attentive semantic ranking subnetwork to mine video-level tags, which can learn discriminative features for semantic ranking and lead to semantic consistent segmentation masks. For the third one, we propose an optical flow warping subnetwork to constrain fine-grained segmentation smoothness within neighbor frames, which can suppress the large deformation and thus obtain smooth object boundaries for adjacent frames. Experiments on two benchmark data sets, i.e., DAVIS data set and YouTube-Objects data set, demonstrate the effectiveness of the proposed approach for segmenting out video objects under weak supervision.",

keywords = "optical warping, semantic ranking, Video object segmentation, weak supervision",

author = "Le Yang and Junwei Han and Dingwen Zhang and Nian Liu and Dong Zhang",

note = "Publisher Copyright: {\textcopyright} 1992-2012 IEEE.",

year = "2018",

month = aug,

doi = "10.1109/TIP.2018.2834221",

language = "英语",

volume = "27",

pages = "4025--4037",

journal = "IEEE Transactions on Image Processing",

issn = "1057-7149",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "8",

}

TY - JOUR

T1 - Segmentation in Weakly Labeled Videos via a Semantic Ranking and Optical Warping Network

AU - Yang, Le

AU - Han, Junwei

AU - Zhang, Dingwen

AU - Liu, Nian

AU - Zhang, Dong

PY - 2018/8

Y1 - 2018/8

N2 - Weakly supervised video object segmentation (WSVOS) focuses on generating pixel-level object masks for videos only tagged with class labels, which is an essential yet challenging task. For WSVOS, the algorithm is just aware of rough category information rather than the concrete object size and location cues, besides it lacks reliable annotated exemplars to learn temporal evolution in the investigated videos. Basically, there are three challenging factors which may influence the performance of WSVOS: foreground object discovery in each frame, coarse object semantic consistency within each video, and fine-grained segmentation smoothness within neighbor frames. In this paper, we establish a semantic ranking and optical warping network to simultaneously solve these three challenges in a unified framework. For the first challenge, we apply the still image saliency detection method and discover the foreground object for each frame via a segmentation network. Due to the huge discrepancies between the image saliency and the video object segmentation, we step further and propose two subnetworks to solve the other two challenges. For the second one, we propose an attentive semantic ranking subnetwork to mine video-level tags, which can learn discriminative features for semantic ranking and lead to semantic consistent segmentation masks. For the third one, we propose an optical flow warping subnetwork to constrain fine-grained segmentation smoothness within neighbor frames, which can suppress the large deformation and thus obtain smooth object boundaries for adjacent frames. Experiments on two benchmark data sets, i.e., DAVIS data set and YouTube-Objects data set, demonstrate the effectiveness of the proposed approach for segmenting out video objects under weak supervision.

AB - Weakly supervised video object segmentation (WSVOS) focuses on generating pixel-level object masks for videos only tagged with class labels, which is an essential yet challenging task. For WSVOS, the algorithm is just aware of rough category information rather than the concrete object size and location cues, besides it lacks reliable annotated exemplars to learn temporal evolution in the investigated videos. Basically, there are three challenging factors which may influence the performance of WSVOS: foreground object discovery in each frame, coarse object semantic consistency within each video, and fine-grained segmentation smoothness within neighbor frames. In this paper, we establish a semantic ranking and optical warping network to simultaneously solve these three challenges in a unified framework. For the first challenge, we apply the still image saliency detection method and discover the foreground object for each frame via a segmentation network. Due to the huge discrepancies between the image saliency and the video object segmentation, we step further and propose two subnetworks to solve the other two challenges. For the second one, we propose an attentive semantic ranking subnetwork to mine video-level tags, which can learn discriminative features for semantic ranking and lead to semantic consistent segmentation masks. For the third one, we propose an optical flow warping subnetwork to constrain fine-grained segmentation smoothness within neighbor frames, which can suppress the large deformation and thus obtain smooth object boundaries for adjacent frames. Experiments on two benchmark data sets, i.e., DAVIS data set and YouTube-Objects data set, demonstrate the effectiveness of the proposed approach for segmenting out video objects under weak supervision.

KW - optical warping

KW - semantic ranking

KW - Video object segmentation

KW - weak supervision

UR - http://www.scopus.com/inward/record.url?scp=85047021095&partnerID=8YFLogxK

U2 - 10.1109/TIP.2018.2834221

DO - 10.1109/TIP.2018.2834221

M3 - 文章

AN - SCOPUS:85047021095

SN - 1057-7149

VL - 27

SP - 4025

EP - 4037

JO - IEEE Transactions on Image Processing

JF - IEEE Transactions on Image Processing

IS - 8

ER -

Segmentation in Weakly Labeled Videos via a Semantic Ranking and Optical Warping Network

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this