How to extract useful information from an image has long been an essential question for feature extraction in the field of Computer Vision. For Convolution Neural Network (CNN), there is no explicit built-in mechanism. However, through the training, CNN can implicitly learn to extract the relevant features to make predictions. For example, analyses like Grad-CAM [1] can show which part of the input contributes the most to the prediction, as shown in Figure 1. Another paradigm besides CNN is the Transformer-based model like ViT [2] and DINO [3], which has attention as an intrinsic mechanism. Attention is a mechanism that enables models to focus on the most relevant parts of an input when making predictions, as shown in Figure 2. Initially popularized in Natural Language Processing, attention soon made its way to dominate Computer Vision due to its effectiveness in focusing on the critical information of the input.
Inspired by these research, we try to explore the possibility of Forward Green-Attention. Specifically, we aim to find the important or salient region in an input image in a feed-forward fashion without backpropagation. One method we try is to leverage the SHAP value of XGBoost [4]. SHAP values help interpret the XGBoost model by showing how each individual feature drives the prediction from the base value (the expected value across all predictions) toward the final output for a specific data instance. In the binary classification scenario, a positive SHAP value of a feature indicates that this particular feature pushes the prediction toward one. On the other hand, a negative SHAP value of a feature suggests that the feature pushes the prediction toward zero. By finding the spatial location of the features that have high positive SHAP values, we can find the attention region that leads the model to make a positive prediction. Contrarily, the location of the features that give a large negative SHAP value provides the attention region that drives the model to make a negative prediction. With this analysis, we can gain insight into which part of the image is more important and relevant to the classification problem. The limitation of this method is that the size and shape of the attention region are limited to the size and shape of the feature receptive field, i.e., we can not get a per-pixel attention score. Moreover, this method highly relies on the power of features. A further and deeper exploration of this topic is needed.
[1] Selvaraju, Ramprasaath R., et al. “Grad-CAM: visual explanations from deep networks via gradient-based localization.” International journal of computer vision 128 (2020): 336-359.
[2] Dosovitskiy, Alexey. “An image is worth 16×16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020).
[3] Caron, Mathilde, et al. “Emerging properties in self-supervised vision transformers.” Proceedings of the IEEE/CVF international conference on computer vision. 2021.
[4] Lundberg, Scott M., Gabriel G. Erion, and Su-In Lee. “Consistent individualized feature attribution for tree ensembles.” arXiv preprint arXiv:1802.03888 (2018).