Object tracking is a fundamental computer vision problem that finds a wide range of applications, such as video surveillance, smart traffic system, autonomous driving cars and so on. Nowadays, most state-of-the-art object trackers adopt deep neural networks for high tracking performance at the expense of huge computational resources and heavy memory use. Here, we seek a more lightweight solution that requires fewer resources for training and inference and has a much smaller model size, thus making real-time tracking possible on small devices such as mobile phone and autonomous drones.

The proposed object tracker is built upon the PixelHop framework so that it is called OTHop (Object Tracking PixelHop). The term “hop” denotes the neighborhood of a pixel. OTHop conducts spectral analysis using Saab transform on neighborhoods of various sizes centered on a pixel through a sequence of cascaded dimension reduction units, which naturally forms a multi-resolution feature extraction scheme, thus helping capture unusual patterns that we should pay more attention to during tracking. Then we adopt the XGBoost classifier as the binary predictor to differentiate foreground pixels and background pixels. The classifier is pre-trained on some offline dataset and then updated online using either the initial frame or preceding frames with Saab coefficients as the input. Base on the classification results we derive the object bounding box.

To sum up, OTHop has the following main steps:

  1. Extract joint spatial-spectral features based on the PixelHop framework;
  2. Predict the probability of a spatial region, which can be of various sizes, of being a foreground object or a background region with a trained XGBoost binary classifier;
  3. Fuse results obtained at different hops in Steps 2 to obtain the ultimate object bounding boxes.

The tracker is tested on the popular general object tracking benchmark TB-50 where coverage ratio (success plot) and precision (precision plot) are used to evaluate its overall performance on 50 long video sequences. We also adopt other performance metrics such as training time, inference time and model size to demonstrate its advantages with respect to the lightweight requirement.



Pic1 credit to [1]

[1] Fiaz, Mustansar, et al. “Handcrafted and deep trackers: Recent visual object tracking approaches and trends.” ACM Computing Surveys (CSUR) 52.2 (2019): 1-44.


— by Zhiruo Zhou