It's no secret at all that I am a massive fan of pairing radiance fields with semantics. I've written about it multiple times and always get excited when I see a new method. This brings me to last night when Foundation Model Embedded Gaussian Splatting (FMGS) came across my screen.
For those not familiar, semantic paired methods literally allow you to ask a radiance field a question about what's contained inside it. It can be a simple question like, where did I leave my olive oil in the kitchen, or what screw do I use first to assemble this desk I got from IKEA?
There was actually another Gaussian Splatting linked method just last week named LangSplat, but this newest method also caught my attention. LangSplat offers a 200X speed up compared to LERF, but FMGS...FMGS is 800X faster than LERF! They're able to achieve 103.4 fps, while it's running. Admittedly, the authors have not released a lot of media or examples that I can show, but I greatly would like to.
Returning to the earlier paper of LangSplat, it seems to be strong at utilizing SAM; I am super curious how the two new papers could each be leveraged on top of one another. FMGS stands out by its integration of vision-language embeddings from foundation models directly into the 3D scene representation, merging visual and linguistic data effectively. On the other hand, LangSplat takes a slightly different approach, focusing on constructing a 3D language field by enhancing each Gaussian with language embeddings distilled from CLIP and utilizing a tile-based splatting technique for rendering language features.
How is FMGS getting such a ridiculous speed boost? It feels like so much of the work people do can be tied back to NVIDIA's Multi Resolution Hash Encoding, or Instant NGP. They're not actually using Instant NGP, because that is NeRF based, but they direct inspiration from it. In FMGS, this speed boost is achieved through the innovative integration of multi-resolution hash encoding, enhancing the efficiency of the framework.
The distinguishing feature of FMGS is its integration of vision-language embeddings from foundation models. These embeddings are incorporated into the 3D scene representation, enabling the model to understand and interpret the semantic content within the scene. In practice, this involves distilling feature maps generated from image-based foundation models and rendering them from the 3D GS model, effectively merging visual and linguistic data.
While we've seen various efforts aimed at optimizing Gaussian splatting, FMGS introduces a unique solution to the challenge. To navigate the memory and computational constraints often encountered, FMGS leverages a Multi-Resolution Hash Encoding (MHE). This method works in tandem with Gaussian Splatting, enhancing its ability to efficiently represent complex language content within 3D scenes.
This component uses hash tables at multiple resolutions, reducing the computational load while maintaining the quality of the semantic embeddings. A key innovation in FMGS is the introduction of a pixel alignment loss. This component ensures that the rendered feature distance of semantically similar entities is minimized, adhering to pixel-level semantic boundaries. This aspect of FMGS contributes to the framework's ability to provide high-quality rendering and fast training, crucial for practical applications.
FMGS employs a unique training procedure that involves supervising the MHE-based language feature field using a hybrid feature map. This map is derived from multi-scale image crops obtained from various viewpoints. The training process ensures that the language embeddings capture relevant features at each scale, allowing for a comprehensive representation of the scene.
For querying, FMGS allows users to interact with the 3D scene using natural language. The model generates relevancy maps based on the query, highlighting semantically relevant parts of the scene.
Unlike traditional methods that focus either on geometric accuracy or semantic understanding, FMGS excels in both. It provides a more holistic understanding of the scene by integrating detailed geometry with rich semantic context. Additionally, FMGS demonstrates a significant improvement in inference speed and versatility compared to other state-of-the-art methods.
FMGS opens up a plethora of possibilities in augmented reality and robotics. In AR, it can enhance user experiences by providing more accurate and interactive representations of physical spaces. In robotics, FMGS can be instrumental in developing robots that understand and navigate spaces more effectively, recognizing objects not just by their shape but also by their semantic properties.
Funnily enough, in order to not go insane out of boredom in the days between Christmas and New Years, I had a long phone call with a friend who it it click for him how many opportunities there are for this. Some of the ones we spoke about was hospital and patient SOP management, evacuation and simulation methods, and a grocery store automating inventory. Not far from that, some of my personal favorite are in the agricultural space.
Given that FMGS comes out of Google, I have to imagine how they might be thinking about it benefiting search. I would be curious to see how a user of Google Maps might be using FMGS. My thought on more everyday uses, such as asking, where is the bathroom in this coffee shop?
Think about all the possibilities of what you can do with radiance fields paired with semantics. What do you think? How would you use a radiance field that can highlight what's contained in it?
Their authors have also stated that they will be releasing their code after the paper has been accepted.