Background
3D simulation renders provide a way to expand computer vision image datasets without real-world photography. However, developing realistic 3D environments requires significant time and skill. Seemingly simple tasks such as changing daytime scenes to nighttime lighting pose challenges for individuals without formal 3D environment engine training. Recent improvements in AI image generation offer editing features that make visual improvements and alterations based on text prompts. Diffusion models can perform day-to-night transformations within minutes, reducing the time and effort required for such tasks.
Approach
This project aimed to leverage image diffusion AI models to reduce the time and effort required to create high-quality simulated images. Two variations of a city street were configured in the Unreal Engine, shown in Figure 1. One contained high-quality models and lighting, commonly referred to as assets, developed by professional 3D artists. The second contained simple assets representative of entry-level 3D development skills. Both variations generated random configurations of vehicles and camera angles in the scene’s parking spots during render sequences, as well as segmentation masks of the vehicles.
Scene renders were passed through Stable Diffusion XL (SDXL) to apply weather and lighting changes and improvements. ControlNets were used to retain important image detail such as the street layout and vehicle position and shape. The resulting images and segmentation masks were compiled into two datasets to test transformations on high-quality versus low-quality 3D environment scenes. Tests measured the performance of a YOLOv11 computer vision detection model trained on the high-quality and low-quality datasets by validating with real-world images from the ActiveVision™ dataset. These results were compared to baseline results of a YOLOv11 model trained on COCO.
Accomplishments
SDXL successfully transformed and improved the lighting, weather, and style of 3D simulation renders as seen in Figure 2 and Figure 3. The COCO-trained YOLOv11 model detected simulated vehicles with 72% accuracy, an improvement from 69.8% accuracy detecting real vehicles in the ActiveVision dataset. The model trained on simulated images experienced a drop to 18.6% and 14.6% accuracy, but the cause was successfully identified as compression issues in the segmentation masks. Changing the format of the masks will improve label quality and increase accuracy. These successes pave the way for further improvement of simulated image datasets.
Figure 1: High-quality (left) and low-quality (right) versions of simulated scene.
Figure 2: Low-angle camera view of scene transformed by SDXL.
Figure 3: High-angle camera view of scene transformed by SDXL.