Video object tracking is one of the fundamental computer vision problems and has found rich applications in video surveillance, autonomous navigation, robotics vision, etc. In the setting of online single object tracking (SOT), a tracker is given a bounding box on the target object at the first frame and then predicts its boxes for all remaining frames. Online tracking methods can be categorized into two categories, unsupervised and supervised. Traditional trackers are unsupervised. Recent deep-learning-based (DL-based) trackers demand supervision. Unsupervised trackers are attractive since they do not need annotated boxes to train supervised trackers. The performance of trackers can be measured in terms of accuracy (higher success rate), robustness (automatic recovery from tracking loss), and speed (higher FPS).

We examine the design of an unsupervised high-performance tracker and name it UHP-SOT (Unsupervised High-Performance Single Object Tracker) in this work. UHP-SOT consists of three modules: 1) appearance model update, 2) background motion modeling, and 3) trajectory-based box prediction. Previous unsupervised trackers pay attention to efficient and effective appearance model update. Built upon this foundation, an unsupervised discriminative-correlation-filters-based (DCF-based) tracker STRCF [1] is adopted by UHP-SOT as the baseline in the first module. Yet, the use of the first module alone has shortcomings such as failure in tracking loss recovery and being weak in box size adaptation. We propose ideas for background motion modeling and trajectory-based box prediction to address the mentioned problems. The baseline tracker gets initialized at the first frame. For the following frames, UHP-SOT gets proposals from all three modules and chooses one of them as the final prediction based on a fusion strategy, as shown in Fig. 1. Fig. 2 shows example results on sequences from the OTB-2015 [2] benchmark. Our tracker runs at a near real-time speed of 22.7 FPS on a CPU based on our preliminary study.

 

Reference:

[1] Li, Feng, et al. “Learning spatial-temporal regularized correlation filters for visual tracking.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

[2] Y. Wu, J. Lim, and M.-H. Yang, “Object tracking benchmark,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1834–1848, 2015.