In recent years, there has been a rapid development of image synthesis techniques based on convolutional neural networks (CNNs), such as the variational auto-encoder (VAE) and generative adversarial networks (GANs). They have shown their ability to generate realistic images that are hard for people to tell which is fake and which is real. Most state-of-the-art CNN generated image detection methods are formulated on deep neural networks. However, their performance can be easily restrained on specific fake image datasets and fail to generalize well to other datasets.
We propose a new CNN-generated-image detector, named Attentive PixelHop (or A-PixelHop). A-PixelHop is designed under the assumption that it is difficult to synthesize high-quality high-frequency components in local regions. Specifically, we first select edge/texture blocks that contain significant high frequency components, then apply multiple filter banks to them to obtain rich sets of spatial-spectral responses as features. Different filter bank features may have different importance on the deciding fake and real, therefore, we feed features to multiple binary classifiers to obtain a set of soft decisions, and we only select the ones with highest discrimination ability. Finally, we develop an effective ensemble scheme to fuse the soft decisions from more discriminant channels into the final decision. System design is shown in Figure 1 below. Compared with CNN-based fake image detection methods, our method has low computational complexity and a small model size, high detection performance against a wide range of generative models, and mathematical transparency since. Experimental results show that A-PixelHop outperforms all state-of-the-art benchmarking methods for CycleGAN-generated images, see Table 1. Furthermore, it can generalize well to unseen generative models and datasets, see Table 2.