Congratulations to Jintang Xue for passing his Qualifying Exam
We congratulate Jintang Xue for passing his Qualifying Exam! His thesis proposal is titled “Towards Efficient and Interpretable Language Representations for Multimodal Reasoning Systems.” His Qualifying Exam Committee members include Jay Kuo (Chair), Antonio Ortega, Ashutosh Nayyar, Vasileios Magoulianitis, and Robin Jia (Outside Member). Here is a brief summary of his thesis proposal:
Modern multimodal reasoning systems increasingly rely on large-scale neural models to jointly process perception and language. While such models achieve strong performance, they often suffer from high computational cost and limited interpretability. We explore a representation-centric approach that explicitly designs language representations to improve both reasoning capability and efficiency.
The first part of the work focuses on multimodal 3D scene understanding. We introduce an object-centric framework that augments each object with a natural language description capturing both intrinsic attributes and inter-object relationships. These descriptions are integrated into large language model–based pipelines through a dual-level strategy, including embedding-level fusion and prompt-level injection. By explicitly encoding relational semantics in language, the proposed approach significantly enhances grounding, captioning, and question-answering performance in complex 3D scenes, particularly for relational reasoning tasks.
The second part of the work investigates efficient and interpretable language representations. We propose a weakly supervised feature selection framework for word embedding dimensionality reduction, which preserves semantic similarity while substantially reducing computational and storage costs. Unlike black-box compression methods, the proposed approach directly identifies the most informative embedding dimensions, improving both efficiency and interpretability.
Together, this work demonstrates that explicitly structured language representations can serve as a powerful and practical alternative to purely scale-driven modeling, enabling multimodal reasoning systems that are more efficient, interpretable, and controllable.









