Computer Vision and Scene Analysis

MCL Research on Small Neural Netwrok

Deep learning has shown great capabilities in many applications. Many works have proposed different architectures to improve the accuracy. However, such improvement may come at a cost of increased time and memory complexity. Time and memory complexity can be important to some applications such as mobile and embedded applications. For these applications, small neural network design can be helpful. Small neural networks aim to reduce the network size while maintaining good performance. Some examples of small neural networks include SqueezeNet [1], MobileNet [2], ShuffleNet [3].

Despite the success of small neural networks, the reason why such networks can achieve good performance while significantly reducing the size has not been studied. In our research, we aim to quantitatively justify the design of small neural networks. In particular, we currently focus on the design of SqueezeNet [1].  SqueezeNet significantly reduces the number of network parameters while maintaining comparable performance by

Replacing some of the 3×3 filters with 1×1 filters. Since each 3×3 filter has 9 weights while a 1×1 filter has only 1 weight, we can greatly reduce the number of parameters by using 1×1 filters in place of 3×3 filters.
Reduce the number of input channels to 3×3 filters. This significantly reduces the number of parameters for the 3×3 filters.
Activation maps are downsampled late in the network. This is motivated by the intuition that larger activation maps may improve accuracy.

A key module of SqueezeNet is the Fire module. A Fire module consists of a squeeze layer and a subsequent expand layer. The squeeze layer reduces the number of input channels to the 3×3 filters in the expand layer. In our work, we use some metrics and visualization techniques to analyze the role of [...]

By |May 3rd, 2020|Computer Vision and Scene Analysis, News, Research|Comments Off on MCL Research on Small Neural Netwrok|

MCL Research on Source-Distribution-Aimed Generative Model

There are typically two types of statistical models in mechine learning, discriminative models and generative models. Different from discriminative models that aim at drawing decision boundaries, generative models target at modeling the data distribution in the whole space. Generative models tackle a more difficult task than discriminative model because it needs to model complicated distributions. For example, generative models should capture correlations such as “Things look like boats are likely to appear near things that look like water” while discriminative model differentiates “boat” from “not boat”.

Image generative models have become popular in recent years since Generative Adversarial Network (GANs), can generate realistic natural images. They, however, have no clear relationship to probability distributions and suffer from difficult training process and mode dropping problem. Although difficult training process and mode dropping problems may be alleviated by using different loss functions [1], the underlying relationship to probability distributions remains vague in GANs. It encourages us to develop a SOurce-Distribution-Aimed (SODA) generative model that aims at providing clear probability distribution functions to describe data distribution.
There are two main modules in our SODA generative model. One is finding proper source data representations and the other is determining the source data distribution in each representation. One proper representation for source data is joint spatial-spectral representation proposed by Kuo, [2, 3]. By transforming between spectral domain and spatial domain, a rich set of spectral and spatial representations can be obtained. Spectral representations are vectors of Saab coefficients while spatial representations are pixels in an image or Saab coefficients that are arranged based on their pixel order in spatial domain. Spectral representation at the last stage give a global view of an image while the spatial representations describe details in [...]

By |April 27th, 2020|Computer Vision and Scene Analysis, News, Research|Comments Off on MCL Research on Source-Distribution-Aimed Generative Model|

MCL Research on Image Super-resolution

Image super-resolution (SR) is a classic problem in computer vision (CV), which aims at recovering a high-resolution image from a low-resolution image. As a type of supervised generative problem, image SR attracts wide attention due to its strong connection with other CV topics, such as object recognition, object alignment, texture synthesis and so on. Besides, it has extensive applications in real world, for example, medical diagnosis, remote sensing, biometric information identification, etc.

For the state-of-the-art approaches for SR, typically there are two mainstreams: 1) example-based learning methods, and 2) Deep Learning (CNN-based) methods. Example-based methods either exploit external low-high resolution exemplar pairs [1], or learn internal similarity of the same image with different resolution scales [2]. In order to tackle model overfitting and generativity, some dictionary strategies are normally applied for encoding (e.g. Sparse coding, SC). However, features used in example-based methods are usually traditional gradient-related or just handcraft, which may affect model efficiency. While CNN-based SR methods (e.g. SRCNN [3]) does not really distinguish between feature extraction and decision making. Lots of basic CNN models/blocks are applied to SR problem, e.g. GAN, residual learning, attention network, and provide superior SR results. Nevertheless, the non-explainable process and exhaustive training cost are serious drawbacks of CNN-based methods.

By taking advantage of reasonable feature extraction [4], we utilize spatial-spectral compatible features to express exemplar pairs. In addition, we formulate a Successive-Subspace-Learning-based (SSL-based) method to partition data into subspaces by feature statistics, and apply regression in each subspace for better local approximation. Moreover, some adaptation is also manipulated for better data fitting. In the future, we aim at providing such a SSL-based explainable method with high efficiency for SR problem.

— By Wei Wang



[1] Timofte, Radu, Vincent De Smet, and [...]

By |April 6th, 2020|Computer Vision and Scene Analysis, News, Research|Comments Off on MCL Research on Image Super-resolution|

Image characterization and categorization based on learning of visual attention

Author: Jia He, Xiang Fu, Shangwen Li, Chang-Su Kim, and C.-C. Jay Kuo

Visual attention of an image, known as saliency of the image, is defined as the regions and contents of the image that attract human eyes’ attention, such as regions with high-contrast, bright luminance, vivid color, clear scene structure and so on, or can be the semantic objects that human expect to see. Our research is to learn the visual attention of the image database, and then develop image characterization and classification algorithms according to the learned visual attention features. These algorithms will be applied into image compression, retargeting, annotation, segmentation, image retrieval, etc.

Recently, the image saliency has been widely studied. However, most work focuses on extracting the salience map of the image using a bottom-up context computation framework [1~5]. The saliency of the image does not always match exactly the visual attention of human, since human tend to “be attracted” by things of their particular interests. To bridge the gap, the learning of visual attention should combine both bottom-up and top-down frameworks. To achieve this goal, we are building a hierarchical human perception tree and learning the image visual attentions with detailed image characteristics, including the salient region’s appearance, semantics, attention priority and intensity. And then the image classification will be based on the content of the saliency area and its saliency intensity. Our system will capture not only the locations of visual attention regions in an image but also estimate their priorities and intensities.

Building a hierarchical human perceptual tree for visual attention learning will be challenging because of its complication, and little work has been done on this modeling. We aim to model the perceptual tree as close as possible [...]

By |November 21st, 2013|Computer Vision and Scene Analysis|Comments Off on Image characterization and categorization based on learning of visual attention|

Hierarchical Bag-of-Words Model for Joint Muli-View Object Representation and Classification

Author: Xiang Fu, Sanjay Purushotham, Daru Xu, and C.-C. Jay Kuo

Rapid development of video sharing over the Internet creates a large number of videos every day. It is an essential task to organize or classify tons of Internet images or videos automatically online, which will be mostly helpful to the search of useful videos in the future, especially in the applications of video surveillance, image/video retrieval, etc.

One classic method to categorize videos for human is based on which makes content-based video analysis a hot topic. An object is undoubtedly the most significant component to represent the video content. Object recognition and classification plays a significant role for intelligent information processing.

The traditional tasks of object recognition and classification include two parts. One is to identify a particular object in an image from an unknown viewpoint given a few views of that object for training, which is called “multi-view specific object recognition”. Later on, researchers attempt to get the internal relation of object classes from one specific view, which develops to another task called “single-view object classification”. In this case, the object class diversity in appearance, shape, or color should be taken into consideration. These variations increase the difficulty in classification. Over the last decade, many researchers have solved the last two tasks using a concept called intra-class similarity. To further reduce the semantic gap between machine and human, the problem of “multi-view object classification” needs to be well studied.

As shown in Fig.1, there are three elements to define a view: angle, scale (distance), and height, which form the view sphere. Although the viewpoint as well as intra-class variations exist as illustrated in Fig.2, some common features can still be found for one object class by [...]

By |November 21st, 2013|Computer Vision and Scene Analysis|Comments Off on Hierarchical Bag-of-Words Model for Joint Muli-View Object Representation and Classification|

Automatic Image Annotation

Author: Shangwen Li and C.-C. Jay Kuo

The text based information retrieval techniques has achieved significant progress over the last decades, resulting huge search engine company like Google. However, the image based retrieval problem is still an open field with no perfect solution. Currently, the content based image retrieval methods attempt to extract low-level feature (including shape, color, texture etc.) and search for related images based on the similarity of features. However, this is rather unreliable since one object will have different look under different scenario. Another way to handle this problem would be first annotating the image with key concepts within the images, and then using text based search method to retrieve relevant information. However, manually labeling of images is a tremendous time consuming activity. Consequently, automatic annotation becomes a potential way of solving image retrieval problem.

Current I am still searching for a good solution to the automatic image annotation problem. As shown in Fig. 1, the biggest challenge lies in the image annotation is how we can link the low-level features and high-level linguistic concept together. In current literature, there are no satisfied solutions for this. Typically, the F measure of all proposed algorithms are lower than 0.5. My way of solution aims at first trying to annotate the image with some metadata, like human/non-human, indoor/outdoor, visual salient or not etc. By first categorizing the image into some coarse classes, we can apply different methods to each class accordingly.

Currently, lots of image annotation algorithms are trying to utilize probabilistic topic model to link the features and concept [1][2][3][4]. There are also other methods that tried to use KNN method to solve the problems [5]. However, none of them are trying to divide the [...]

By |November 21st, 2013|Computer Vision and Scene Analysis|Comments Off on Automatic Image Annotation|

TEAM: Ensemble Classifier for Large-Scale Indoor/Outdoor Images

Author: Chen Chen and C.-C. Jay Kuo

An ensemble classifier, called TEAM (The Experts Assembling Machine), is proposed for the classification of large-scale indoor/outdoor images in this work. Instead of applying a single classification method to a large number of extracted image features, TEAM integrates decisions made by a set of individual classifiers, where each of them is called an expert. Although the classification performance of an expert is reasonable for small datasets, its performance degrades as the dataset size increases. TEAM offers robust and accurate classification performance for a large-scale indoor/outdoor image dataset since it allows experts to compensate each other’s weakness in face of diversified image data. We conduct experiments on an image dataset containing around 100,000 images, and show that TEAM can improve the classification accuracy of individual experts by a margin ranging from 6-20%.

We set a benchmark for Indoor/Outdoor classification algorithms by releasing a dataset with around 100,000 indoor/outdoor images. Image examples can be seen in Fig.1. We also implemented and released 7 individual experts, in order to benchmark the strength of TEAM. Related source code can be found in the links below.

Fig. 1

The proposed TEAM stands on the shoulders of state-of-the-art indoor/outdoor image classification methods. Unlike other ensemble learning algorithms (such as Adaboost and Random Forests), TEAM learned the complementariness between different algorithms at the decision level. Decisions by different methods are achieved by different low-level features and different classification methods. As a result, TEAM can perform more robustly while the data size increases. In addition, TEAM can also serve as a framework for all scene classification problems (not restricted to indoor/outdoor image classification).

For the future, there are two directions to do.

Currently, we only focus on low-level feature experts. We [...]

By |November 21st, 2013|Computer Vision and Scene Analysis|Comments Off on TEAM: Ensemble Classifier for Large-Scale Indoor/Outdoor Images|

3-D Object Classification & Retrieval

Author: Xiaqing Pan and C.-C. Jay Kuo

3-D Object Classification & Retrieval problem can be stated as identifying a correct class or retrieve relevant objects for a query object. In the past years, researchers developed several useful signatures to describe 3-D objects in a compact way. They focused on global description, graph description and local description. However, most of these signatures cannot handle a generic database well because of their limitations on differentiating 3-D objects in different shapes, poses and surface properties. My research is aiming to develop a sophisticated signature and a complete classification and retrieval scheme to produce high retrieval performance and classification accuracy on a generic database.

Global signatures such as [1] [2] [3] try to handle capture the global shape basis in a 3-D object but lose the details. Local signature such as [4] starts from local salient points and then builds up a statistical signature for an entire mesh but it is not robust under large shape variance. Graph signatures extract the topological information from a mesh and analyze it but only effective to limited cases. Our idea is going to conquer the limitations from the previous researches and design a natural description for 3-D objects, which highly complies with human perception.

Challenges will be conquered in the future.

Ability to differentiate and group objects with different and similar shapes
Robustness under large pose changes
Adaptiveness to variance in surface properties

[1] Osada R, Funkhouser T, Chazelle B, Dobkin D (2002) Shape distributions. ACM Trans Graph 21(4):807–832
[2] Shen Y-T, Chen D-Y, Tian X-P, Ouhyoung M (2003) 3D model search engine based on lightfield descriptors. In: Proc. eurographics 2003
[3] Kazhdan M, Funkhouser T, Rusinkiewicz S (2003) Rotation invariant spherical [...]

By |November 21st, 2013|Computer Vision and Scene Analysis|Comments Off on 3-D Object Classification & Retrieval|