TY - JOUR
T1 - ETIA:Enhancing Text2Image surround view scene generation with semantic annotation via diffusion for autonomous driving
T2 - Enhancing Text2Image Surround View Scene Generation With Semantic Annotation via Diffusion for Autonomous Driving
AU - Ramyashree,
AU - Raghavendra, S.
AU - Abhilash, S. K.
AU - Nookala, Venu Madhav
AU - Kumar, Arun
AU - Malashree, P.
N1 - Publisher Copyright:
© IEEE. 2013 IEEE.
PY - 2025
Y1 - 2025
N2 - Generating high-fidelity surround view images from text prompts is a complex task that requires balancing contextual coherence with computational efficiency. The proposed work introduces a novel methodology that combines recurrent attention-based encoder-decoder architectures with text-to-image diffusion models to produce coherent and continuous surround view images. The approach utilizes a custom text encoder to convert input text prompts into contextual embeddings, which are then processed by the proposed ViewNet Unet2d architecture within the decoder. This architecture employs dual cross-attention mechanisms: one aligns text embeddings with corresponding noise image latents, while the other integrates previously generated image latents to ensure continuity across the sequence. This method guarantees that each generated image adheres to its specific prompt, while maintaining coherence with preceding images. In addition, an annotation decoder was introduced that generates semantic segmentation maps, instance segmentation masks, and object detection annotations. The annotation decoder processes latent image maps using a shared feature extraction backbone and dedicated heads for each annotation task.Experimental results on the nuScenes validation set demonstrate the effectiveness of the proposed model in producing high-quality contextually aligned surround view images. The proposed model achieves an FVD of 99 and an FID of 12.6, outperforming existing methods such as Panacea+ and DriveDreamer-2. Furthermore, our approach improves segmentation and detection accuracy, achieving a PQ of 67.4, mIoU of 80.1, and mAP of 65.4, surpassing methods like OpenSeeD and D2Det. An ablation study highlights the contributions of key components in our architecture. Integrating positional encoding, self-attention, and concurrent attention significantly enhances generation quality, reducing FVD to 99 and FID to 12.6. Experimental results demonstrate the effectiveness of proposed work in producing high-quality, contextually aligned surround view images with comprehensive annotations, pushing the boundaries of text-to-image synthesis and scene understanding.
AB - Generating high-fidelity surround view images from text prompts is a complex task that requires balancing contextual coherence with computational efficiency. The proposed work introduces a novel methodology that combines recurrent attention-based encoder-decoder architectures with text-to-image diffusion models to produce coherent and continuous surround view images. The approach utilizes a custom text encoder to convert input text prompts into contextual embeddings, which are then processed by the proposed ViewNet Unet2d architecture within the decoder. This architecture employs dual cross-attention mechanisms: one aligns text embeddings with corresponding noise image latents, while the other integrates previously generated image latents to ensure continuity across the sequence. This method guarantees that each generated image adheres to its specific prompt, while maintaining coherence with preceding images. In addition, an annotation decoder was introduced that generates semantic segmentation maps, instance segmentation masks, and object detection annotations. The annotation decoder processes latent image maps using a shared feature extraction backbone and dedicated heads for each annotation task.Experimental results on the nuScenes validation set demonstrate the effectiveness of the proposed model in producing high-quality contextually aligned surround view images. The proposed model achieves an FVD of 99 and an FID of 12.6, outperforming existing methods such as Panacea+ and DriveDreamer-2. Furthermore, our approach improves segmentation and detection accuracy, achieving a PQ of 67.4, mIoU of 80.1, and mAP of 65.4, surpassing methods like OpenSeeD and D2Det. An ablation study highlights the contributions of key components in our architecture. Integrating positional encoding, self-attention, and concurrent attention significantly enhances generation quality, reducing FVD to 99 and FID to 12.6. Experimental results demonstrate the effectiveness of proposed work in producing high-quality, contextually aligned surround view images with comprehensive annotations, pushing the boundaries of text-to-image synthesis and scene understanding.
UR - https://www.scopus.com/pages/publications/105011655366
UR - https://www.scopus.com/pages/publications/105011655366#tab=citedBy
U2 - 10.1109/ACCESS.2025.3591146
DO - 10.1109/ACCESS.2025.3591146
M3 - Article
AN - SCOPUS:105011655366
SN - 2169-3536
VL - 13
SP - 132209
EP - 132222
JO - IEEE Access
JF - IEEE Access
ER -