Understanding and retrieving information in 3D scenes poses a significant challenge in artificial intelligence (AI) and machine learning (ML), particularly in grasping complex spatial relationships and detailed properties of objects in 3D spaces. Multiple tasks are suggested to assess 3D understanding, such as 3D object retrieval, 3D captioning, 3D question answering, 3D vision grounding, etc.

Existing methods can be roughly divided into two categories. The first category utilizes large 2D foundational models for feature extraction and maps 2D pixel-wise features to 3D point-wise features for 3D tasks. For example, the 3D-CLR model [1] extracts 2D features from multiview images with the CLIP-LSeg model [2] and maps the 2D features to 3D points in a reconstructed neural radiance field compact representation. The reasoning process is performed via a set of neural reasoning operators. The 3D-LLM model [3] utilizes 2D vision-language models (VLM) as the backbone. It extracts 2D features with the ConceptFusion model [4] and maps them to 3D points. Then, the 3D information is injected into a large language model to generate text outputs.

Another group of methods directly handles 3D point clouds with a 3D encoder and tries to align the extracted 3D features with the features from other modalities. This group of methods may require the training of a 3D encoder and may need many computational resources. For example, the Uni3D [5] leverages a unified vanilla transformer structurally equivalent to a 2D Vision Transformer (ViT) as the backbone to extract 3D features. Downstream tasks can be achieved after feature alignment among different modalities. It is also possible to leverage pre-trained 3D encoders. Point-SAM [6] utilizes the point cloud encoder from the Uni3D to transform the input point cloud into embeddings. It starts by sampling a fixed number of centers using Farthest Point Sampling (FPS)  and groups the k-nearest neighbors of each center into patches. It aims to segment anything in 3D worlds.

The approach to achieving comprehensive 3D understanding remains unclear, leaving significant room for improvement in this field. We may need to explore novel algorithms, enhance current techniques, or even develop entirely new frameworks to solve the problems.

Reference:

[1] Hong, Yining, et al. “3D Concept Learning and Reasoning from Multi-View Images,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

[2] Li, Boyi, et al. “Language-driven semantic segmentation.” arXiv preprint arXiv:2201.03546 (2022).

[3] Hong, Yining, et al. “3d-llm: Injecting the 3d world into large language models.” Advances in Neural Information Processing Systems 36 (2023): 20482-20494.

[4] Jatavallabhula, Krishna Murthy, et al. “Conceptfusion: Open-set multimodal 3d mapping.” arXiv preprint arXiv:2302.07241 (2023).

[5] Zhou, Junsheng, et al. “Uni3d: Exploring unified 3D representation at scale.” arXiv preprint arXiv:2310.06773 (2023).

[6] Zhou, Yuchen, et al. “Point-SAM: Promptable 3D Segmentation Model for Point Clouds.” arXiv preprint arXiv:2406.17741 (2024).

Image credits:

The image showing the architecture of 3D-LLM is from [3].

The image showing the architecture of Uni3D is from [5].