News

  • Permalink Gallery

    MCL Research on 3D Perception with Large Foundational Models

MCL Research on 3D Perception with Large Foundational Models

Understanding and retrieving information in 3D scenes poses a significant challenge in artificial intelligence (AI) and machine learning (ML), particularly in grasping complex spatial relationships and detailed properties of objects in 3D spaces. While large foundational models such as CLIP [1] have made impressive progress in the 2D domain, their direct application to 3D scene understanding is not straightforward. Therefore, we aim to leverage existing large foundation models to understand 3D scenes and extract 3D information, avoiding building an entirely new 3D model from scratch.
The advancement of large foundation models has inspired recent studies to establish connections among images, text, and 3D data. The work in [2] extracts features from 3D objects with a 3D encoder and aligns them with features extracted by a visual encoder and a text encoder from corresponding rendered 2D images and text, enabling information retrieval of 3D objects from different modalities. However, the model mainly focuses on 3D objects and is limited in understanding complicated 3D spaces like 3D indoor scenes.
[3] Reconstructed compact 3D indoor scenes from multi-view images using DVGO and aligned them with semantic segmentation extracted from corresponding 2D multiview images using the CLIP-LSeg model. A visual reasoning pipeline is applied for reasoning tasks in 3D spaces.
Presently, we are going to develop a ML model that leverages the existing large foundation models to get a better understanding of 3D scenes with lower computational complexity.

Reference:
[1]Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. “Learning transferable visual models from natural language supervision,” In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
[2] Hegde, Deepti, Jeya Maria Jose Valanarasu, and Vishal [...]

By |April 7th, 2024|News|Comments Off on MCL Research on 3D Perception with Large Foundational Models|

MCL Research on Green Image Coding

Image coding is a fundamental multimedia technology. Popular image coding
techniques include hybrid and deep learning-based frameworks. Our green image
coding (GIC) aims to explore a new path to a low-complexity and high-efficiency
coding framework.
The main characteristics of the proposed GIC are multi-grid representation and vector
quantization (VQ). Natural images have rich frequency components and high energy.
Multi-grid representation uses resampling techniques to decompose the image into
multiple layers so that the energy and frequency components are distributed to
different layers. Then, we use vector quantization to handle each layer.
Based on the research on our previous GIC framework [1][2]. We identified two key
issues of the multi-grid representation + vector quantization framework. First, unlike
the traditional coding framework, vector quantization does not need transform.
Because transform decorrelates the input signals, while vector quantization needs
correlation exists among different dimensions of the input signals. Second, when
applying multi-grid representation in compression, the key is bitrate allocation, i.e.,
how to assign bits to different layers. For traditional frameworks, bitrate allocation or
rate control happens in different blocks or frames with a relative parallel relationship.
But for our multi-gid representation. The bitrate allocation is a sequential operation.
Because the bitrate of the current layer has a significant influence on the next. For this
issue, we use a slope matching technique to do the bitrate allocation. Specifically,
when the RD slope of the current layer decreases to the start slope of the next, we stop
coding the current layer and switch to the next.

[1] Wang Y, Mei Z, Zhou Q, et al. Green image codec: a lightweight learning-based
image coding method[C]//Applications of Digital Image Processing XLV. SPIE, 2022,
12226: 70-75.
[2] Wang Y, Mei Z, Katsavounidis [...]

By |March 31st, 2024|News|Comments Off on MCL Research on Green Image Coding|

MCL Research on Green Point Cloud Surface Reconstruction

Surface reconstruction from point cloud scans plays a pivotal role in 3D vision and graphics, finding diverse applications in areas such as AR/VR games, cultural heritage preservation, and building information modeling (BIM). This task is inherently challenging due to the ill-posed nature of reconstructing continuous surfaces from discrete points. Moreover, real-world point cloud scans introduce quite a few obstacles such as varying densities and sensor noise. These properties make the problem a long standing one, which keeps driving researchers to look for more effective solutions.
Early research focused on combinatorial methods [1]–[2], which inferred the connectivity between points directly. The mainstream of surface reconstruction adopts an implicit surface approach [3]–[4]. That is, the surface is represented as an unknown continuous function which is solved by the associated partial differential equations (PDEs). Although these methods offer good quality, they often require oriented normals or additional constraints. Recently, people develop deep learning (DL) models [5]–[6] to solve this problem based on a supervised learning framework. DL methods exploits the training data to learn an implicit function representation.
Despite their high reconstruction quality, the generalizability and complexity remain to be challenges for DL models. In scenarios such as point cloud compression, quality assessment, and dynamic point cloud processing, there is a growing need for low-complexity, low-latency surface reconstruction methods. However, existing PDE and DL-based methods tend to sacrifice simplicity for high reconstruction quality, leaving a gap for low-complexity solutions.

We adopt the unsupervised framework, propose a lightweight high-performance method, and name it green point-cloud surface reconstruction (GPSR). It can be categorized to the family of implicit surface reconstruction methods. The main idea lies in building a signed distance field (SDF) through approximated heat diffusion and fine tuning it iteratively. [...]

By |March 24th, 2024|News|Comments Off on MCL Research on Green Point Cloud Surface Reconstruction|

MCL Research on POS Tagging Prediction

Part of speech (POS) tagging is one of the basic sequence labeling tasks. It aims to tag every word of a sentence with its part-of-speech attribute. As POS offers a fundamental syntactic attribute of words, POS tagging is useful for many downstream tasks, such as speech recognition, syntactic parsing, and machine translation. POS tagging is a crucial preliminary step in building interpretable NLP models. POS tagging has been successfully solved with complex sequence-to-sequence models based on deep learning (DL) technology, such as LSTM and Transformers. Additionally, considering recent advancements in Large Language Models (LLMs), LLMs possess the capability to perform the POS tagging task as versatile models. However, DL models demand higher computational and storage costs. Notably, the POS tagging task itself doesn’t inherently require such elevated computational and storage costs. There is a need for lightweight high-performance POS taggers to offer efficiency while ensuring efficacy for downstream tasks. 

We propose a novel word-embedding-based POS tagger and name it GWPT to meet this demand. Following the green learning (GL) methodology (Kuo & Madni, 2022), GWPT contains three cascaded modules: 1) representation learning, 2) feature learning, and 3) decision learning. The last two modules of GWPT adopt the standard procedures, i.e., the discriminant feature test (DFT) (Yang et al.,2022) for feature selection and the XGBoost classifier in making POS prediction. The main novelty of this work lies in the representation learning module of GWPT. GWPT derives the representation of a word based on its embedding. Both non-contextual embeddings and contextual embeddings can be used. GWPT partitions dimension indices into low-, medium-, and high-frequency three sets. It discards dimension indices in the low-frequency set and considers the N-gram representation for dimension indices in the medium- and high-frequency [...]

By |March 17th, 2024|News|Comments Off on MCL Research on POS Tagging Prediction|

MCL Research on Saliency Detection Method

Saliency detection research predominantly falls within two categories: human eye fixation prediction, which involves the prediction of human gaze locations on images where attention is most concentrated [1], and salient object detection (SOD) [2], which aims to identify salient object regions within an image. Our study specifically focuses on the former saliency prediction, which predicts human gaze from visual stimuli.

 

Saliency detection constitutes a crucial task in predicting human gaze patterns from visual stimuli. The escalating demand for research in saliency detection is driven by the growing necessity to incorporate such techniques into various computer vision tasks and to understand human visual system. Many existing saliency detection methodologies rely on deep neural networks (DNNs) to achieve good performance. However, the extensive model sizes associated with these approaches impede their integration with other modules or deployment on mobile devices.

 

To address this need, our study introduces a novel saliency detection method, named “GreenSaliency”, which eschews the use of DNNs while emphasizing a small model footprint and low computational complexity. GreenSaliency comprises two primary steps: 1) multi-layer hybrid feature extraction, and 2) multi-path saliency prediction. Empirical findings demonstrate that GreenSaliency achieves performance levels comparable to certain deep-learning-based (DL-based) methods, while necessitating a considerably smaller model size and significantly reduced computational complexity.

 

 

 

[1] Zhaohui Che, Ali Borji, Guangtao Zhai, Xiongkuo Min, Guodong Guo, and Patrick Le Callet, “How is gaze influenced by image transformations? dataset and model”, IEEE Transactions on Image Processing, 29, 2287–300.

 

[2] Dingwen Zhang, Junwei Han, Yu Zhang, and Dong Xu, “Synthesizing supervision for learning deep saliency network without human annotation”, IEEE transactions on pattern analysis and machine intelligence, 42(7), 1755–69.

By |March 10th, 2024|News|Comments Off on MCL Research on Saliency Detection Method|

Welcome New MCL Member Xuechun Hua

We are so happy to welcome a new MCL member, Xuechun Hua joining MCL this semester. Here is a quick interview with Xuechun:

1. Could you briefly introduce yourself and your research interests?

My name is Xuechun Hua, I am a second year graduate student at Viterbi pursuing a Computer Science master degree. I finished my undergraduate study in Nanjing University(NJU) and focused on game theory. I developed my research interest in interpretable machine learning and computer vision.

What is your impression about MCL and USC?

USC has offered me an exceptional environment that fosters both my personal and academic development. The atmosphere at MCL is one of serious academic pursuit combined with a strong sense of collaboration. I am deeply thankful for the extensive support I’ve received from senior members of MCL. Professor Kuo stands out as an endlessly energetic researcher, who greatly inspires us with his guidance and enthusiasm for exploration.

3. What is your future expectation and plan in MCL?

I will be collaborating with Qingyang on a project focused on mesh reconstruction from point clouds, under the mentorship of Professor Kuo. Moving forward, I aim to enhance my research skills within MCL and contribute to advancements in 3D representation and greening learning.

By |March 3rd, 2024|News|Comments Off on Welcome New MCL Member Xuechun Hua|

MCL Research on Transfer Learning

Transfer learning is an approach to extract features from source domain and transfer the knowledge of source domain to the target domain, so as to improve the performance of target learners while relieving the pressure to collect a lot of target-domain data[1]. It has been put into wide applications, such as image classification and text classification.

Domain adaptation on digital number datasets, namely MNIST, USPS and SVHN, is the task of transferring learning models across different data sets, aiming to conduct cross-domain feature alignment and build a generalized model that is able to predict labels among various digital number datasets.
Presently, we have trained a green-learning based transfer learning model between MNIST and USPS. The first step is preprocessing and feature extraction, including feature processing for different dataset in order to make them visually similar, raw Saab feature extraction [2] and LNT feature transformation [3] followed by cosine similarity check to find discriminant features. The second step is joint subspace generation, in which for each label in source domain, k-means with number of clusters as 1, 2, 4, 8 are performed separately in order to generate 10, 20, 40 and 80 subspaces, then assign target features to generated subspaces. The third step is to utilize assigned datas to conduct weakly supervised learning to predict label
for rest target data samples. Our goal is to compare and analyze the performance of our green-learning based transfer learning models with other models. In future, we aim to conduct transfer learning among these three digital number datasets mutually and improve the accuracy by improving cross-domain feature alignment.

[1] Zhuang, Fuzhen, et al. “A comprehensive survey on transfer learning.” Proceedings of the IEEE 109.1 (2020): 43-76.
[2] Y. Chen, M. Rouhsedaghat, [...]

By |February 25th, 2024|News|Comments Off on MCL Research on Transfer Learning|

MCL Research on Image Demosaicing

In the world of digital images, turning raw sensor data into colorful pictures that we can see is a complex process. Demosaicing, a crucial step in this process, helps convert data from sensors into full-color images. It all started with Bayer color filter arrays, named after Bryce Bayer, which are grids of tiny sensors, each covered by a color filter—usually red, green, or blue.

But making this conversion isn’t easy. Real-world challenges like sensor noise and blurry motion can mess up the final image. And getting accurate data for training computers to do this job is time-consuming.

Because processing images can be slow, especially on devices like phones, we’re looking into simpler methods that still give good results. Recently we’re experimenting with a new regressor named “LQBoost”

LQBoost operates by conducting least-square regressions in successive iterations, gradually narrowing down the gap between its predictions and the actual targets. By focusing on minimizing the residuals—differences between predicted and actual values—LQBoost iteratively enhances its accuracy. Additionally, it employs a thresholding process to prune samples with residuals approaching zero, streamlining the regression process.

Taking LQBoost a step further, we integrate Local Neighborhood Transformation (LNT) to enrich the feature set and capture intricate data structures more effectively. This integration allows for a more nuanced understanding of the data, leading to improved predictions.

Before applying LQBoost to our demosaicing task, we perform a crucial preprocessing step. By clustering the dataset and utilizing cluster purity, we initialize the regression process effectively. This step ensures that each cluster receives an accurate initial prediction, setting the stage for LQBoost to refine these predictions through iterative regression.

Our goal is to create a demosaicing model that’s both accurate and fast. We’ve tested it thoroughly using standard image datasets, making [...]

By |February 18th, 2024|News|Comments Off on MCL Research on Image Demosaicing|

MCL Research on LQBoost Regressor

LQBoost operates on the principle of leveraging successive linear regressions to mimic the target regression. Each least-square regression serves as a simulation of the corresponding target regression space, mapping the feature space into the current target space and approximating samples within it. The residue of the current target space becomes the regression target for the next iteration. Iteratively, samples with residuals nearing zero are pruned through a thresholding process. Building upon the foundation of LQBoost, in addition to iteratively optimizing the model through thresholding, we can further enhance the feature set using LNT. In each iteration, by removing samples with residuals close to zero, we obtain a clearer and more accurate approximation of the target space. Subsequently, to enrich the feature set and better capture the complex structures within the data, we can utilize LNT to perform local features transform. This iterative regression reduces the gap between the cumulative simulation and the target, resulting in increasingly accurate approximations.

Before this, we performed preprocessing by clustering the dataset into several clusters and using the purity of each cluster as the initial value for regression. If some clusters have high purity, the samples within those clusters use the major label of the cluster as their predicted value. For other clusters, we use the purity as the initial predicted value, and the residue generated by these clusters serves as the target space for the first layer of least square regression in LQBoost.
This preprocessing step is essential for initializing the regression process effectively. By clustering the dataset and utilizing cluster purity, we can assign more accurate initial predictions for each cluster. High-purity clusters, where most samples belong to a single class, provide a straightforward prediction based on the majority label. On [...]

By |February 11th, 2024|News|Comments Off on MCL Research on LQBoost Regressor|

MCL Research on SLMBoost Classifier

In a Machine Learning framework, there are three fundamental components that play a crucial role: feature extraction, feature selection, and decision model. In the context of Green Learning, we have Saab transform and Least-squares Normal Transform for feature extraction. Regarding feature selection, we have the Discriminant Feature Test (DFT) and Relevant Feature Test (RFT). However, we do not have a green and interpretable solution for the decision model. For a long time, we applied gradient-boosted trees such as XGBoost or LightGBM as the classifier. Yet, it is known that XGBoost or LightGBM models sacrifice interpretability for performance. Also, the large model size of XGBoost or LightGBM is becoming a huge burden for Green Learning. Therefore, we are motivated to develop a green and interpretable classifier called SLMBoost. The idea is to train a boosting model with the Subspace Learning Machine (SLM). 

Let’s start by looking at a single SLM. In each SLM, we will first identify a discriminant subspace by a series of feature selection techniques, including DFT, RFT, removing correlated features, etc. Then, each SLM will learn a linear regression model on the selected subspace. Figure 1 illustrates a single SLM. To sum up, a single SLM is a linear least square model that operates on a subspace.

Further, we ensemble SLMs in a boosting fashion, which involves training a sequence of SLMs. In this approach, an SLM is trained to correct the errors made by the previous SLMs. To achieve this, the training target of an SLM is the residual of the accumulated result from all the SLMs before it. Two key factors to make this boosting work are changing different features and data subsets. By using DFT/RFT, we can zoom in on different [...]

By |February 4th, 2024|News|Comments Off on MCL Research on SLMBoost Classifier|