With the rise of editable NeRF methods last year, such as Instruct NeRF2NeRF and InpaintNeRF 360, the industry has been eagerly anticipating advancements in 3D scene editing. A promised goal was the creation of a ControlNet style editable NeRFs.
Just like how ControlNet revolutionized control over prompting with Stable Diffusion, SigNeRF is primed to make NeRF inpainting way easier.
The method specifically uses an inpainting version of ControlNet with Stable Diffusion XL, using the SD WebUI API and the Diffusers library.
For those not familiar with inpainting, it allows you to select something within the scene being displayed to be modified. We've been able to play with inpainting in 2D images for a while now, but what we're looking at here is the addition of photorealistic three dimensional in painting. While this is not the first method to tackle this issue, Scene Integrated Generated Neural Radiance Fields, or SigNeRF, represents a significant (no pun intended) leap in this technology.
This step involves creating a foundational NeRF scene, which is the starting point for any subsequent editing. A set of input images, along with corresponding camera parameters, are used to reconstruct a NeRF scene. This reconstruction forms a 3D representation of the original, unedited scene.
Rather than processing each frame through ControlNet independently, combining them into depth and masking images for a unified run through ControlNet resulted in significantly stronger inpainting.
What I mean by that is it's difficult to guarantee each individual frame will be well suited to generate a clean output. By introducing continuity across frames, akin to a rising tide lifting all ships, a more robust 'Reference Sheet Generation' is achieved. Importantly, one grid cell in this sheet is intentionally left empty.
However, there's more to it than just creating a large reference sheet. This is where the next part of SigNeRF comes in: Reference Sheet Updates.
Now we have the reference sheet generation with their corresponding depth maps and inpainting, with that empty slot just sitting there. You may have seen this coming, but that's intentional and is about to be put to work.
Each image in the NeRF dataset is updated individually, a process that involves inserting the color, depth, and mask image of each specific camera view into the empty slot of the reference sheet. This step cleverly aligns each camera view with the reference sheet's edits, transferring the sheet's consistency to each updated image. The ControlNet model then processes these combined grids, ensuring that the edits maintain 3D coherence across the scene. This step is repeated for all camera views in the NeRF dataset.
That's the baseline of how SigNeRF works for existing items, but what about generation? What if I want to have an object or animal in the scene that wasn't there previously?
The foundations of how it works begins the same as any NeRF capture, with the inclusion of a trained model. SigNeRF uses nerfstudio's nerfacto. Then you instruct SigNeRF with what you would like (in the demo, it's a rabbit). A proxy object representing the desired addition (the rabbit) is placed within the original NeRF scene. This proxy doesn't need to be a detailed model of a rabbit; it can be a simple placeholder that defines the location, scale, and general shape where the rabbit will appear. This is actually how I capture scenes with no central focus or object in my personal radiance field library.
It's so important to have this proxy object because it establishes where in the 3D space of the scene the new object will be integrated. Virtual cameras are positioned around the proxy object to capture it from various angles. This step is essential to ensure the rabbit will appear consistent from different viewpoints in the final scene. For each of these camera views, images along with depth maps and masks are rendered. These renderings include the proxy object within the scene.
Now we see the reintroduction of the two reference sheet methods. The rendered images from the previous step are assembled into a reference sheet in a grid layout. Remember, one grid cell is left empty. The reference sheet, including the proxy object images, is processed through ControlNet. Here, the proxy is replaced with the generative model of a rabbit. This step involves using a text prompt or other guiding input to direct the model to generate images of a rabbit in the place of the proxy object.
Each image in the original dataset is updated using the reference sheet. The image of each camera view is inserted into the empty slot of the reference sheet, and ControlNet generates a new version of this image, now including the rabbit in place of the proxy object. The key here is that ControlNet ensures the newly added object is consistent in appearance and lighting with the rest of the scene for each camera view, which was a previous challenge.
After all images in the dataset have been updated to include the rabbit, the final step is to re-train or fine-tune the original NeRF with this updated dataset. This step integrates the newly generated images, ensuring that the 3D representation of the scene now includes the rabbit as if it were always part of the original scene.
Re-training the NeRF will take longer than just fine tuning it, so it's important to know when to fine tune and when to retrain. The authors of SigNeRF have found that it yields better results to retrain when object generation is used and the opposite for generative editing. They also recommend using only 5 images in the reference sheet to balance the trade off of quality versus compute overhead.
There are quite a few steps SigNeRF takes to function, but how long does that translate to for the end user? Somewhat surprisingly, they utilize the standard Nerfacto method from nerfstudio and it's trained to just 30K steps, which gives us the original training time. The actual SigNeRF method adds
SigNeRF actually makes a ton of sense and I wouldn't be surprised if we see it being added into existing radiance field platforms, such as Luma. I would further be interested to see how the Luma AI team could utilize their text to 3D method, Genie, for inpainting in Luma captures. I'm not sure how difficult it would be for their team to implement, but given Matt Tancik is an author of both Instruct NeRF2NeRF and Nerfacto, I would be very curious.
If SigNeRF and its findings don't make it to their platform, I'm really excited that the SigNeRF Github has indicated that they will be making it available within nerfstudio after the code has been reviewed. There's not a timeline associated with that, but it seems to indicate that the public at large will be able to experiment.
Things are happening fast. It's only January 4th! As we kick off the year, the pace of innovation in NeRF technology is exhilarating. SigNeRF is a testament to this rapid advancement, and its potential impact on 3D modeling and scene editing is immense.