One major technique debt in video object segmentation is to label the object masks for training instances. As a result, we propose to prepare inexpensive, yet high quality pseudo ground truth corrected with motion cue for video object segmentation training. Our method conducts semantic segmentation using instance segmentation networks and, then, selects the segmented object of interest as the pseudo ground truth based on the motion information. Afterwards, the pseudo ground truth is exploited to finetune the pretrained objectness network to facilitate object segmentation in the remaining frames of the video.  We show that the pseudo ground truth could effectively improve the segmentation performance. This straightforward unsupervised video object segmentation method is more efficient than existing methods. Experimental results on DAVIS and FBMS show that the proposed method outperforms state-of-the-art unsupervised segmentation methods on various benchmark datasets. And the category-agnostic pseudo ground truth has great potential to extend to multiple arbitrary object tracking.

Our goal is to segment the primary video object without manual annotations. The proposed method does not use the temporal information of the whole video clip at once but one frame at a time. Errors from each consequent frame do not propagate along time. As a result, the proposed method has higher tolerance against occlusion and fast motion. We evaluate the proposed method extensively on the DAVIS dataset the FBMS dataset. Our method gives state-of-the-art performance in both datasets with the mean intersection-over-union (IoU) of 79.3% on DAVIS, and 77.9% on FBMS.

This paper will appear in Asian Conference on Computer Vision (ACCV) 2018.


–By Ye Wang