In order to get an accurate perception of surrounding environment in different tasks including autonomous driving, robot navigation, and sensor-driven situational awareness, abundant environment information is necessary. This information can be obtained from different types of multimodal sensors, such as LiDAR sensors, electro-optical/infrared (EO/IR) cameras, GPS/IMU. Before using the collected data, information fusion among these sensors is a critical topic. Specifically, people want to utilize color and shape information from camera and distance information from LiDAR sensors. In which task, the process of finding correspondent points between two sensors is essential. This procedure is called multimodal sensor calibration, in which we need to find the 6DoF extrinsic parameters between these two sensors.

In this work, we develop a new deep learning-driven technique for accurate calibration of LiDAR-Camera pair, which is completely data-driven, does not require any specific calibration targets or hardware assistants, and the entire processing is end to end and fully automatic. We utilize the advanced deep neural network to align accurately the LiDAR point cloud to the image, and regress 6DoF extrinsic calibration parameters. Geometric supervision and transformation supervision are employed to guide the learning process to maximize the consistency of input images and point clouds. Given input LiDAR-Camera pairs as training dataset, the system automatically learns meaningful features, infers modal cross-correlations, and estimates the accurate 6DoF rigid body transformation between the 3D LiDAR and 2D image in real-time.

Images in slides show the system overview and experiment results. In experiment results, the background is the correspondent RGB image. The transparent colormap is the depth map, from blue to red corresponding small to a large distance. The first row is the input RGB images. The second row is input depth maps, the third row is predicted depth maps, the fourth row is ground truth depth maps, and each depth map is overlaid onto the RGB images. The red rectangle boxes in the second row represent the misalignment between the input depth maps and RGB images, and in the third row represent the accurate alignment between the predicted depth maps and RGB images.


[1] Zhao, G., Hu, J., You, S., and Kuo, C.-C. J., “CalibDNN: Multimodal Sensor Calibration for Perception Using Deep Neural Networks,” Proc. SPIE, 11756-46 (2021)