Feature extraction in Computer Vision aims to pinpoint relevant information in images. While Convolutional Neural Networks (CNNs) implicitly learn feature importance, tools like Grad-CAM can help interpret which image regions influence predictions. More recently, Transformer-based models like ViT and DINO have gained traction by incorporating attention mechanisms that naturally focus on critical input parts, improving interpretability.

Building on these ideas, Jie-En Yao from MCL lab proposes a novel approach: Forward Green-Attention, which identifies essential regions in an image without requiring backpropagation. This method utilizes SHAP values from XGBoost to highlight regions that push model predictions positively or negatively. High positive SHAP values reveal areas driving positive classifications, while negative values indicate regions leading to negative classifications. Though promising, this approach is limited by the receptive field size and feature-dependent interpretability, highlighting areas for further refinement.