Congratulations to Yeji Shen for passing his defense on Sep 7, 2021. His Ph.D. thesis is entitled “Labeling Cost Reduction Techniques for Deep Learning: Methodologies and Applications”. Here we invite Yeji to share a brief introduction of his thesis and some words he would like to say at the end of the Ph.D. study journey.

1) Abstract of Thesis

Deep learning has contributed to a significant performance boost of many computer vision tasks. Still, the success of most existing deep learning techniques relies on a large number of labeled data. While data labeling is costly, a natural question arises: is it possible to achieve better performance with the same budget of data labeling? We provide two directions to address the problem: more efficient utilization of the budget or supplementing unlabeled data with no labeling cost. Specifically, in this dissertation, we study three problems related to the topic of reducing the labeling cost: 1) active learning that aims at identifying most informative unlabeled samples for labeling; 2) weakly supervised 3D human pose estimation that utilizes a special type of unlabeled data, action-frozen people videos, to help improve the performance with few manual annotations; and 3) self-supervised representation learning on a large-scale dataset of images with text and user-input tags at no additional labeling cost.

In the first part of this talk, we will introduce our representation work which mainly focuses on the utilization of textual information in images. Text information inside images could provide valuable cues for image understanding. We propose a simple but effective representation learning framework, called the Self-Supervised Representation learning of Images with Texts (SSRIT). SSRIT exploits optical character recognition (OCR) signals in a self-supervision manner. SSRIT constructs a representation that is trained to predict whether the text in the image contains particular words or phrases. This allows us to leverage unlabeled data to uncover the non-textual visual features shared by images that contain similar text.  The SSRIT representation is beneficial to image tag prediction and functional image classification. In both tasks, SSRIT outperforms baseline models with no OCR information as well as models that consume OCR with no self-supervised representation.

In the second part, we will briefly review our previous works in 3D pose and active learning. Our 3D pose work is featured by the proposed MVM method that can efficiently extract and recover 3D poses from a special kind of videos of action-frozen people. We further build a dataset using this method. The active learning part contains two works: K-covers and TBAL. Both methods focus on techniques to balance the uncertainty and diversity when selecting samples. With the help of proposed active learning strategies, we achieve state-of-the-art performance in MNIST, SVHN, CIFAR-10 and CUB datasets.

2) Ph.D. experience:

I would like to express my sincere gratitude to Prof. Kuo. Prof. Kuo is the role model for not only conducting research but also the life attitude. His diligence, self-discipline, sense of responsibility and endless enthusiasm to research are probably something that I may not be able to achieve in my life. And I hope that I could be closer to Prof. Kuo in the future. Besides, I would also like to say thanks to my MCL labmates. Life as a PhD student is never easy. It becomes slightly better with the help and support from friends in our lab. I wish you a bright future for all fellow MCL members.