
Michael Rubloff
Jan 23, 2026
The SpatialView team has published new work on integrating Large Multimodal Models with Gaussian Splatting to enable natural language search within 3D reconstructions. The approach transforms passive visual models into queryable spatial databases, allowing users to locate objects and assets through conversational queries rather than manual tagging.
Traditional 3D reconstruction pipelines produce high-fidelity visual representations but lack semantic understanding. While photogrammetry and radiance field representations such as NeRFs and Gaussian Splatting excel at capturing geometry and appearance, they provide no native mechanism for answering questions like "where are the access panels" or "show me all equipment near the cooling system." SpatialView's workflow addresses this gap by maintaining the 2D image dataset as a semantic interface rather than attempting to embed meaning directly into 3D geometry.
The system operates in four stages. First, Structure from Motion or SLAM establishes camera poses and generates the 3D model. Second, embedding models create a semantic index over the entire image collection, allowing vector similarity search to filter candidate frames based on text queries. Third, Large Multimodal Models reason over the filtered images to identify unique object instances and produce 2D grounding coordinates. Finally, raycasting projects these pixel coordinates into 3D space, creating persistent annotations at precise locations within the reconstructed scene.
This architecture enables several practical capabilities. Inventory auditing that previously required days of manual annotation can now be completed in minutes through simple text queries. Users can ask contextual questions like "find all access panels left open" or "locate equipment within two meters of the primary cooling line" without training custom models. Maintenance records, inspection histories, and operating manuals can be anchored directly to objects in their spatial context, turning the 3D model into a navigable knowledge base. As new imagery is captured, the system can refresh the index automatically, keeping digital twins synchronized with real world changes.
The approach combines AI's visual reasoning capabilities with the spatial precision of 3D reconstruction, enabling environments to be both visualized, understood, and queried programmatically.
Learn more about SpatialView's approach to semantic 3D search on their blog.






