USC Media Communications Lab

Permalink Gallery
MCL Research on Unsupervised Video Segmentation

MCL Research on Unsupervised Video Segmentation

We propose a method for unsupervised video object segmentation by transferring the knowledge encapsulated in image-based instance embedding networks. The instance embedding network produces an embedding vector for each pixel that enables identifying all pixels belonging to the same object. Though trained on static images, the instance embeddings are stable over consecutive video frames, which allow us to link objects together over time. Thus, we adapt the instance networks trained on static images to video object segmentation and incorporate the embeddings with objectness and optical flow features, without model retraining or online fine-tuning. The proposed method outperforms state-of-the-art unsupervised segmentation methods in the DAVIS dataset and the FBMS dataset.

The main contributions include

– A new strategy for adapting instance segmentation models trained on static images to videos. Notably, this strategy performs well on video datasets without requiring any video object segmentation annotations.

– Proposal of novel criteria for selecting a foreground object without supervision, based on semantic score and motion features over a track.

– Insights into the stability of instance segmentation embeddings over time.

By Siyang Li

By Xuejing Lei|February 12th, 2018|News, Research|Comments Off|

Image characterization and categorization based on learning of visual attention

Author: Jia He, Xiang Fu, Shangwen Li, Chang-Su Kim, and C.-C. Jay Kuo

Visual attention of an image, known as saliency of the image, is defined as the regions and contents of the image that attract human eyes’ attention, such as regions with high-contrast, bright luminance, vivid color, clear scene structure and so on, or can be the semantic objects that human expect to see. Our research is to learn the visual attention of the image database, and then develop image characterization and classification algorithms according to the learned visual attention features. These algorithms will be applied into image compression, retargeting, annotation, segmentation, image retrieval, etc.

Recently, the image saliency has been widely studied. However, most work focuses on extracting the salience map of the image using a bottom-up context computation framework [1~5]. The saliency of the image does not always match exactly the visual attention of human, since human tend to “be attracted” by things of their particular interests. To bridge the gap, the learning of visual attention should combine both bottom-up and top-down frameworks. To achieve this goal, we are building a hierarchical human perception tree and learning the image visual attentions with detailed image characteristics, including the salient region’s appearance, semantics, attention priority and intensity. And then the image classification will be based on the content of the saliency area and its saliency intensity. Our system will capture not only the locations of visual attention regions in an image but also estimate their priorities and intensities.

Building a hierarchical human perceptual tree for visual attention learning will be challenging because of its complication, and little work has been done on this modeling. We aim to model the perceptual tree as close as possible [...]

By Jia He|November 21st, 2013|Computer Vision and Scene Analysis|Comments Off|

Hierarchical Bag-of-Words Model for Joint Muli-View Object Representation and Classification

Author: Xiang Fu, Sanjay Purushotham, Daru Xu, and C.-C. Jay Kuo

Rapid development of video sharing over the Internet creates a large number of videos every day. It is an essential task to organize or classify tons of Internet images or videos automatically online, which will be mostly helpful to the search of useful videos in the future, especially in the applications of video surveillance, image/video retrieval, etc.

One classic method to categorize videos for human is based on which makes content-based video analysis a hot topic. An object is undoubtedly the most significant component to represent the video content. Object recognition and classification plays a significant role for intelligent information processing.

The traditional tasks of object recognition and classification include two parts. One is to identify a particular object in an image from an unknown viewpoint given a few views of that object for training, which is called “multi-view specific object recognition”. Later on, researchers attempt to get the internal relation of object classes from one specific view, which develops to another task called “single-view object classification”. In this case, the object class diversity in appearance, shape, or color should be taken into consideration. These variations increase the difficulty in classification. Over the last decade, many researchers have solved the last two tasks using a concept called intra-class similarity. To further reduce the semantic gap between machine and human, the problem of “multi-view object classification” needs to be well studied.

As shown in Fig.1, there are three elements to define a view: angle, scale (distance), and height, which form the view sphere. Although the viewpoint as well as intra-class variations exist as illustrated in Fig.2, some common features can still be found for one object class by [...]

By Xiang Fu|November 21st, 2013|Computer Vision and Scene Analysis|Comments Off|

Automatic Image Annotation

Author: Shangwen Li and C.-C. Jay Kuo

The text based information retrieval techniques has achieved significant progress over the last decades, resulting huge search engine company like Google. However, the image based retrieval problem is still an open field with no perfect solution. Currently, the content based image retrieval methods attempt to extract low-level feature (including shape, color, texture etc.) and search for related images based on the similarity of features. However, this is rather unreliable since one object will have different look under different scenario. Another way to handle this problem would be first annotating the image with key concepts within the images, and then using text based search method to retrieve relevant information. However, manually labeling of images is a tremendous time consuming activity. Consequently, automatic annotation becomes a potential way of solving image retrieval problem.

Current I am still searching for a good solution to the automatic image annotation problem. As shown in Fig. 1, the biggest challenge lies in the image annotation is how we can link the low-level features and high-level linguistic concept together. In current literature, there are no satisfied solutions for this. Typically, the F measure of all proposed algorithms are lower than 0.5. My way of solution aims at first trying to annotate the image with some metadata, like human/non-human, indoor/outdoor, visual salient or not etc. By first categorizing the image into some coarse classes, we can apply different methods to each class accordingly.

Currently, lots of image annotation algorithms are trying to utilize probabilistic topic model to link the features and concept [1][2][3][4]. There are also other methods that tried to use KNN method to solve the problems [5]. However, none of them are trying to divide the [...]

By Shangwen Li|November 21st, 2013|Computer Vision and Scene Analysis|Comments Off|

TEAM: Ensemble Classifier for Large-Scale Indoor/Outdoor Images

Author: Chen Chen and C.-C. Jay Kuo

An ensemble classifier, called TEAM (The Experts Assembling Machine), is proposed for the classification of large-scale indoor/outdoor images in this work. Instead of applying a single classification method to a large number of extracted image features, TEAM integrates decisions made by a set of individual classifiers, where each of them is called an expert. Although the classification performance of an expert is reasonable for small datasets, its performance degrades as the dataset size increases. TEAM offers robust and accurate classification performance for a large-scale indoor/outdoor image dataset since it allows experts to compensate each other’s weakness in face of diversified image data. We conduct experiments on an image dataset containing around 100,000 images, and show that TEAM can improve the classification accuracy of individual experts by a margin ranging from 6-20%.

We set a benchmark for Indoor/Outdoor classification algorithms by releasing a dataset with around 100,000 indoor/outdoor images. Image examples can be seen in Fig.1. We also implemented and released 7 individual experts, in order to benchmark the strength of TEAM. Related source code can be found in the links below.

Fig. 1

The proposed TEAM stands on the shoulders of state-of-the-art indoor/outdoor image classification methods. Unlike other ensemble learning algorithms (such as Adaboost and Random Forests), TEAM learned the complementariness between different algorithms at the decision level. Decisions by different methods are achieved by different low-level features and different classification methods. As a result, TEAM can perform more robustly while the data size increases. In addition, TEAM can also serve as a framework for all scene classification problems (not restricted to indoor/outdoor image classification).

For the future, there are two directions to do.

Currently, we only focus on low-level feature experts. We [...]

By Chen Chen|November 21st, 2013|Computer Vision and Scene Analysis|Comments Off|

Age Group Classification via Structured Fusion of Uncertainty-driven Shape Features and Selected Surface Features

Author: Kuan-Hsien Liu, Shuicheng Yan, and C.-C. Jay Kuo

Facial image processing has attracted a lot of attention in the computer vision community over the last two decades. The human face can reveal important perceptual characteristics such as the identification, gender, race, emotion, pose, age, etc. Among these characteristics, the age information has its particular importance. The aging progress is complicated, nonreversible and uncontrollable. It is affected by various factors, including the living environment, climate, health, life style, and biological reasons. Age-related facial image processing is being extensively studied, and facial age group classification is one of major research topics in this area.

Examples include age-based facial image retrieval, internet access control, security control and surveillance, biometrics, age-based human-computer interaction (HCI), age prediction for finding missing children, and age estimation based on the result of age groups classification. Age estimation can be done more accurately if it is worked on groups containing a narrower age range. Hence, the age group classification problem is an interesting one that demands further efforts.

We presented a structured fusion method for facial age group classification as shown in Figure 1. To utilize the structured fusion of shape features and surface features, we introduced the region of certainty (ROC) to not only control the classification accuracy for shape feature based system but also reduce the classification needs on surface feature based system. In the first stage, we design two shape features, which can be used to classify frontal faces with high accuracies. In the second stage, a surface feature is adopted and then selected by a statistical method. The statistical selected surface features combined with a SVM classifier can offer high classification rates. With properly adjusting the ROC by a single non-sensitive [...]

By Kuan-Hsien Liu|November 21st, 2013|Biometrics|Comments Off|

3-D Object Classification & Retrieval

Author: Xiaqing Pan and C.-C. Jay Kuo

3-D Object Classification & Retrieval problem can be stated as identifying a correct class or retrieve relevant objects for a query object. In the past years, researchers developed several useful signatures to describe 3-D objects in a compact way. They focused on global description, graph description and local description. However, most of these signatures cannot handle a generic database well because of their limitations on differentiating 3-D objects in different shapes, poses and surface properties. My research is aiming to develop a sophisticated signature and a complete classification and retrieval scheme to produce high retrieval performance and classification accuracy on a generic database.

Global signatures such as [1] [2] [3] try to handle capture the global shape basis in a 3-D object but lose the details. Local signature such as [4] starts from local salient points and then builds up a statistical signature for an entire mesh but it is not robust under large shape variance. Graph signatures extract the topological information from a mesh and analyze it but only effective to limited cases. Our idea is going to conquer the limitations from the previous researches and design a natural description for 3-D objects, which highly complies with human perception.

Challenges will be conquered in the future.

Ability to differentiate and group objects with different and similar shapes
Robustness under large pose changes
Adaptiveness to variance in surface properties

[1] Osada R, Funkhouser T, Chazelle B, Dobkin D (2002) Shape distributions. ACM Trans Graph 21(4):807–832
[2] Shen Y-T, Chen D-Y, Tian X-P, Ouhyoung M (2003) 3D model search engine based on lightfield descriptors. In: Proc. eurographics 2003
[3] Kazhdan M, Funkhouser T, Rusinkiewicz S (2003) Rotation invariant spherical [...]

By Xiaqing Pan|November 21st, 2013|Computer Vision and Scene Analysis|Comments Off|

Advanced techniques for text detection in compound images

Author: Harshad Kadu, Jian Li, and C.-C. Jay Kuo

Detecting text regions in natural images is an important task for many computer vision applications like compound video compression, optical character recognition, reading text for visually impaired subjects, robotic navigation etc. We are trying to solve this text localization problem, also known as the compound image segmentation problem. Contrary to the scanned documents, text in natural images may have different sizes, fonts, orientations, colors and foreground or background illumination. The cluttered background in natural images may also pose a serious threat to the accuracy of the text localization algorithms. So the compound image segmentation is inherently a difficult problem to solve.

In our research, we propose a novel text localization scheme based on the fusion of diverse local operators such as, the morphological detector, maximally stable extremal regions (MSER) blob analyzer [3, 4], distance transform and stroke-width transform [2]. These operators investigate different peculiar characteristics of text to discover regions with possible textual content. An ensemble of trained SVM classifiers categorizes these regions into text or non-text using the local feature information. Finally a fragment grouping mechanism merges these text candidates together and carves out individual words. Refer to the figure below.

Our proposed fusion technique uses a novel three-tier framework to systematically separate out the individual words in the images. The text regions have some peculiar properties which distinguish them from the non-text regions. To gain insights, we explore these properties using our novel morphological text detector. Apart from the aforementioned operator we also use some existing detectors such as, the MSER [3, 4] and stroke width transform [2] to improve the detection accuracy. We hope to get a significantly enhanced performance using this fusion framework.

The future [...]

By Harshad Kadu|November 21st, 2013|Biometrics|Comments Off|

Facial Recognition in Heterogeneous Environment

Author: Chun-Ting Huang and C.-C. Jay Kuo

“Facial Recognition” has become an important technique to handle the tremendous growing need for identification and verification since last century. The replacement of traditional transaction by electronic transaction successfully gathered attention for facial recognition from research and business communities, because facial recognition requires no physical interaction on behalf of users. The research on facial recognition can be traced back to early 1990s, from the Eigenface proposed by Turk and Pentland in 1991 [1], which has over 11409 citations on Google Scholar. The follow-up development can be concluded into general directions discussed in Face Recognition Vendor Test – FRVT 2002 [2], and different face databases are developed in order to solve various conditions, such as poses, expressions, and environment. A new database called Long Distance Heterogeneous Face Database (LDHF-DB) [3] is focused on face images under various distances and near-infrared camera, which provides an new challenge within this field.

Since under the long distance, the near-infrared camera can only capture blurred and vague face images, as shown in Fig. 1, causing the template feature’s low performance on LDHF-DB. Therefore, our research only adopts geometric and shape-based features, locally and globally, to determine the input of structured-fusion method. Based on the different characteristics of features we collected from database, we aim to develop an robust classification algorithm with machine learning to distinguish faces under various quality.

The major difference between our work and other research is the feature selection and structured fusion model. I have explained the reason why template method only has a fair performance under the influence of heterogeneous environment. Our proposed model can boost up the recognition rate by adopting different feature’s strength and discarding the outliers for particular [...]

By Chun-Ting Huang|November 21st, 2013|Biometrics|Comments Off|

Video Coding/Screen Video Coding with Quality Assessment

Author: Sudeng Hu and C.-C. Jay Kuo

Due to rapidly growing video applications in areas such as wireless display and cloud computing , screen content coding has received much interest from academia and industry in recent years .. The High Efficiency Video Coding (HEVC) standard has achieved significant improvement in coding efficiency as compared with the state-of-the-art H.264/AVC standard. However, HEVC has been designed mainly for natural video captured by cameras. Screen content images and video, also known as compound images, hybrid images, and mixed-raster content material, typically contains computer-generated content such as text and graphics, sometimes in combination with natural or camera-captured material. Since the properties of screen content are quite different from those of natural content, and HEVC currently does not exploit these properties, there is still room for improvement in coding efficiency.

For screen content, it is our observation that directly encoding residual signals in the spatial domain may not be efficient enough. This is because, except for the edge, the remaining areas are still smooth and can be coded more effectively with a transform. In this paper, we propose a new scheme, called Edge Mode (EM), to encode these kinds of blocks. Based on the intra prediction direction, six possible edge positions inside a block are defined, and one of them will be selected via rate-distortion (RD) optimization. To reduce the encoding complexity, the proposed scheme can be further simplified by classifying intra modes into four categories. Then, MXN 2D DCT transforms or non-orthogonal 2D transforms are performed separately in sub-blocks. Finally, the new edge mode is integrated into HEVC to result in a more powerful coding scheme.

[1] Sudeng Hu, Lei Deng, and C.-C. Jay Kuo “A New Distortion/Content-Dependent Video Quality Index (DCVQI),” [...]

By Sudeng Hu|November 21st, 2013|Visual Quality and Perceptual Coding|Comments Off|

Previous 123 Next

Research

MCL Research on Unsupervised Video Segmentation

Recent Posts