The problem of facial expression recognition (FER) attempts to understand human emotion through facial image analysis. The technique can be applied to driver status monitoring, affective computing, and serious games. Solutions to FER can be categorized into two types: conventional methods and deep-learning-based (DL-based) methods. While conventional methods use hand-extracted features, DL-based methods conduct end-to-end optimization of certain networks whose performance highly depends on training data, the network architecture and the cost function. DL-based methods have become popular in recent years because of their higher performance. Yet, they demand a large model size. Although there has been research on reducing the number of parameters of DL models, it does not solve the computational complexity problem completely.

In this research, we are interested in a lightweight FER solution and name it ExpressionHop. ExpressionHop has low computational and memory complexity so that it is most suitable for mobile and edge computing environments. As shown in Figure 1, ExpressionHop consists of four modules: 1) cropping patches out based on facial landmarks, 2) applying filter banks to each patch to generate a rich set of joint spatial-spectral features, 3) conducting the discriminant feature test (DFT) to select features of higher discriminant power, and 4) performing the final classification task with a classifier. We conduct performance benchmarking on ExpressionHop, traditional and deep learning methods on several commonly used FER datasets such as JAFFE, CK+ and KDEF. Experimental results in table 1 show that ExpressionHop achieves comparable or better classification accuracy. Yet, its model size only has 30K parameters, which is significantly lower than those of deep learning methods.

As to the future research directions, there are several extensions to be pursued. First, it is desired to extend ExpressionHop to non-frontal images. Second, it is important to consider more complex settings such as more illumination variations and occlusion. Third, it is interesting to evaluate its performance in a cross-dataset setting. Last, it is desired to generalize the solution from images to videos and boost the classification performance furthermore.

— by Chengwei Wei