Traditional image coding has achieved great success within four decades. Image coding standards have been developed and widely used today such as JPEG and JPEG-2000. Furthermore, intra coding schemes of modern video coding standards also provide very effective image coding solutions. Several powerful tools have been used to de-correlate the pixel values:
1. Block transform coding, which is used in the majority of the codecs where images are partitioned into blocks of different sizes and pixel values in blocks are transformed from the spatial domain to the spectrum domain for energy compaction before quantization and entropy coding.
2. Intra prediction, as another powerful tool that reduces the pixel correlation using pixel values from neighboring blocks at a low cost. Residuals after intra prediction are still coded by block transform coding.
Recently, deep-learning-based compression methods have attracted a lot of attention due to their superior rate-distortion performance. Compared with the traditional codecs, learned based codec has the following characteristic:
1. Inter correlations: Traditional image codecs only explore correlation in the same image while learning-based image codecs can exploit correlation from other images (i.e., inter-image correlation).
2. Multi-scale representation: Traditional image codecs only capture the representation with variable block size while learning-based image codecs can exploit the multi-scale representation based on pooling. In other words, traditional image codecs primarily explore correlation at the block level while learning-based image codecs can exploit short, middle, and long-range correlations using the multi-scale representation.
3. Advanced loss functions: different loss functions can be easily designed in learning-based schemes to fit the human visual system (HVS) and attention can be introduced to the learning-based schemes conveniently.
To achieve low-complexity learning-based image coding, we propose a multi-grid multi-block-size vector quantization (MGBVQ) method based on these characteristics.
1. Input images are decomposed into different representations of variable resolutions through Lanczos interpolation. We can get a set of downsampled images and their corresponding downsample residuals with regard to its neighbor representations.
2. With this different representation available, we can easily capture the correlations using VQ. We capture the long-range correlation in small representations. And short-range correlation in large representations.
3. Components like adaptive codebook selection are used to provide better rate-distortion gain. For example, it can use a small number of codewords for a smooth/simple image while using many codewords for a complex image.
4. Currently, we are working on the context-guided sub-codebook selection which utilizes the pre-decoded representation to find the suitable sub-codebook design.