TY - JOUR
T1 - Joint Camera-LiDAR Scene Synthesis and Perception for Autonomous Driving
AU - Raghavendra, S.
AU - Abhilash, S. K.
AU - Madhav Nookala, Venu
AU - Arun Kumar, P. V.
AU - Anoop, B. N.
AU - Ramyashree, null
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2025
Y1 - 2025
N2 - The advancement of autonomous driving and embedded AI systems has intensified the need for large-scale, richly annotated multimodal datasets encompassing RGB images, semantic labels, and 3D LiDAR data. Manual collection and annotation of such datasets remain costly and time-consuming, especially when temporal and cross-modal consistency is required. The proposed method introduces Joint Camera-LiDAR Scene Synthesis and Perception (JCLSP), a unified generative framework that simultaneously synthesizes photorealistic RGB images, semantic segmentation maps, and LiDAR range images through a compact and optimized diffusion process. Unlike prior approaches that employ separate diffusion branches, JCLSP fuses image and LiDAR modalities early in the pipeline and leverages a shared latent space for coherent multimodal generation. The architecture integrates three key elements: BKSDM, which streamlines the diffusion process by eliminating redundant blocks, a joint image-LiDAR diffusion module that applies the BKSDM framework to enable depth-aware synthesis with geometric fidelity, and modality-specific decoders that extract semantic masks, LiDAR range images, and image scenes from the shared latent representation. Experimental results on synthetic datasets indicate that JCLSP captures meaningful cross-modal correlations and preserves spatial features. By generating joint representations from camera and LiDAR views along with semantic segmentation annotations, the method demonstrates promising potential for cross-modal representation learning with labeled data.
AB - The advancement of autonomous driving and embedded AI systems has intensified the need for large-scale, richly annotated multimodal datasets encompassing RGB images, semantic labels, and 3D LiDAR data. Manual collection and annotation of such datasets remain costly and time-consuming, especially when temporal and cross-modal consistency is required. The proposed method introduces Joint Camera-LiDAR Scene Synthesis and Perception (JCLSP), a unified generative framework that simultaneously synthesizes photorealistic RGB images, semantic segmentation maps, and LiDAR range images through a compact and optimized diffusion process. Unlike prior approaches that employ separate diffusion branches, JCLSP fuses image and LiDAR modalities early in the pipeline and leverages a shared latent space for coherent multimodal generation. The architecture integrates three key elements: BKSDM, which streamlines the diffusion process by eliminating redundant blocks, a joint image-LiDAR diffusion module that applies the BKSDM framework to enable depth-aware synthesis with geometric fidelity, and modality-specific decoders that extract semantic masks, LiDAR range images, and image scenes from the shared latent representation. Experimental results on synthetic datasets indicate that JCLSP captures meaningful cross-modal correlations and preserves spatial features. By generating joint representations from camera and LiDAR views along with semantic segmentation annotations, the method demonstrates promising potential for cross-modal representation learning with labeled data.
UR - https://www.scopus.com/pages/publications/105017151120
UR - https://www.scopus.com/pages/publications/105017151120#tab=citedBy
U2 - 10.1109/ACCESS.2025.3613054
DO - 10.1109/ACCESS.2025.3613054
M3 - Article
AN - SCOPUS:105017151120
SN - 2169-3536
VL - 13
SP - 166740
EP - 166759
JO - IEEE Access
JF - IEEE Access
ER -