Object detection is one of the most essential and challenging tasks in computer vision, while most state-of-the-art object detection methods adopt an end-to-end deep neural network, we aim at an interpretable framework that has low complexity, high efficiency in training, and high performance. The method is built upon the PixelHop framework, as shown in fig 1. The term “hop” denotes the neighborhood of a pixel. Pixelhop conducts spectral analysis on neighborhoods of different sizes centered on a pixel through a sequence of cascaded dimension reduction units. The neighborhoods of an object contain representative patterns of the objects such as salient contours and, as a result, they have distinctive spectral signatures at a certain scale that matches the object size, thus bounding boxes and  class labels can be predicted based on supervised learning with Saab coefficients in proper hops as the representations.

Our method takes YOLO’s problem formulation as reference and ensembles three major modules to finish the object detection task. As shown in fig.1, by proper settings of Pixelhop, we divide all the objects into three different scales, i.e. large(as shown blue), medium (as shown in green), and small (as shown in red), and have hops with proper receptive field (RF) responsible for proposing corresponding anchor boxes for different scales (as shown comparing with the “cat” example). With the Saab coefficients at each hop, we propose anchor boxes at each spatial location, and for each anchor box we train module 1 to predict its confidence score, module 2 to predict its class label, module 3 to predict its box regression. Eventually for each image our model will first propose potential boxes and use non max suppression based on confidence score to keep the best proposed boxes.