In human visual system,  a given image is processed and important information is distilled in order to recognize the objects. The salient regions are informative to draw more human’s attention than other parts of the image, such as backgrounds. Similarly, in computer vision, attention maps can be used to identify and take advantage of the effective spatial support of visual information in making image classification decisions. Besides, it can also be used to help improve the separability of different classes. Other applications of attention also include weakly supervised semantic segmentation, adversarial robustness, weakly object localization, domain shift, etc.

The studies about attention can be categorized into two different types: 1) post-hoc network analysis and 2) trainable attention generation. The former type (such as CAM [1]) analyzes the CNN models after being trained on the image-level labels as a network reasoning process. In contrast, the trainable attention mechanisms (e.g. [2], [3]) use learning targets related to attention in order to generate separable and discriminative attention maps. All of these related work are built based on CNNs in an end-to-end manner which are of high time and computational complexity.

In our research, we try to extract attention maps based on features extracted from channel-wise Saab transform in a feedforward way, which was proposed by Chen et. al. in PixelHop++ [4]. Features from shallow to deep Hops are considered together as a representation for each pixel, since they represent different receptive fields. By putting more weight on the important regions based on the generated attention maps, we expect our model to get better recognition performance, because regions with irrelevant information which are confusing or shared among different classes will be suppressed. This will also make the classification system more transparent and interpretable. Currently, our experiments are conducted on CIFAR-10. In the future, more studies are expected on higher resolution images, such as ImageNet.

 

Reference:

  • [1] Zhou, A. Khosla, A. Lapedriza, A. Oliva and A. Torralba, “Learning Deep Features for Discriminative Localization,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016
  • [2] Jetley, N. A. Lord, N. Lee, and Philip H. S. Torr. “Learn to pay attention,” In International Conference on Learning Representations, 2018. 1, 3
  • [3] Wang et al., “Sharpen Focus: Learning with Attention Separability and Consistency,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019
  • [4] Chen, M. Rouhsedaghat, S. You, R. Rao and C. -C. Jay Kuo, “Pixelhop++: A Small Successive-Subspace-Learning-Based (SSL-Based) Model For Image Classification,” 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 2020

— by Yijing Yang