Thermal and RGB-Aware Vision Language Model for Built Environment with 3D Scene Rendering
Buildings account for nearly 40% of US energy consumption, making effective auditing and retrofitting of building envelopes essential to reducing energy use. However, traditional energy audits remain costly, labor-intensive, and difficult to scale across diverse buildings. This paper proposes a novel workflow that leverages multimodal AI to prepare structured building energy modeling (BEM) inputs directly from close-range RGB and infrared images. The workflow integrates three components: 1) close-range image feature fusion using CapsLab-based thermal anomaly segmentation and feature matching; 2) Neural Radiance Fields (NeRF) for rendering full-scale, photorealistic façades from fragmented image inputs; and 3) LLaVA prompt engineering, guided by vision-related variables screened from the U.S. Energy Information Administration’s 2018 Commercial Buildings Energy Consumption Survey (CBECS) codebook, to extract standardized enclosure properties in structured formats. A pilot study conducted on the D.M. Smith building at Georgia Tech campus demonstrates that the proposed pipeline can accurately identify attributes such as wall construction type, roof material, and number of stories while maintaining consistency with energy auditing standards. This study highlights the potential of AI-assisted workflows to automate key aspects of building audits, reduce labor costs, and generate reliable, simulation-ready data for tools such as EnergyPlus.