When InstructNeRF2NeRF first came out, I knew that it was start of something incredible. I was so excited to see more and more papers with similar functionality. Today we get a new example of that with InpaintNeRF-360.
In a nutshell, InpaintNeRF-360 allows for the removal or editing of objects within a unbounded 360 degree NeRF. It uses a promptable segmentation model to allow a user to specify a part of the NeRF that they would like to edit and ensures that the NeRF continues to look photorealistic after its been edited by applying depth-space warping and perceptual priors.
One large benefit of InpaintNeRF-360 is that it's able to tackle the full 360 of an object that a user wants to edit, rather than say just one side of it. It's able to do this through leveraging the Segment Anything Model (SAM) to get a sense of what the specific object is. While this sounds totally plausible, it gets a little trickier when you expand the request to the full shape of the object, so they use a pre trained image inpainter across each of the images that contain a different view of the object. The output is then fine tuned to ensure view consistency even across crazy camera angles and viewing perspectives.
InpaintNeRF360 encodes semantics into image space to allow for accurate object appearance editing.
NeRFs have proven to be excellent at showcasing photorealistic scenes; however, up to now, there have only been a small collection of papers such as NeRFShop and InstructNeRF2NeRF that allowed for removing or changing objects while preserving geometric and photometric consistency. As of publishing, this is the only method that is able to acheive 360 degree inpainting for a NeRF.
As part of their method, there are three primary steps to InpaintNeRF360. The first one is encoding the text descriptions directly into the NeRF through a model Grounded Language-Image Pre-training (GLIP). GLIP is able to provide segmentation to identifiable objects, but it's not powerful by itself to do what InpaintNeRF360 needs. This is where the second step comes in, utilizing depth information. Bounding boxes tend to be inaccurate when you have an unconstrained input cameras, so feeding the depth information is critical.
Without the depth in the first two images, the view angle can produce information that is not accurate. However, when that depth information represented by the red dots in the third image are included, we see the end result. After these first two steps have been refined, it's finally time to use the SAM model to accurately fill out the NeRF. You might be wondering how while it clearly is able to separate the object, what happens to the background information? To answer that, they apply dilation on each segmentation mask so that contextual background information is included! Finally once the masks are obtained, the inpainting begins.
Again, I think this will be very powerful when combined with LERF. As time goes on, I think the applications that fit in with LERF will only become more powerful and expand potential use cases. nerfstudio co-founder and original NeRF co-author, Matt Tancik, spoke a bit about how utilizing natural language processing in conjunction with NeRFs at his talk at MIT last month. We see a further application and doors opened through InpaintNeRF360. Interestingly, InpaintNeRF360 is built on Nerfstudio and hopefully will mean we get to see it hosted for general consumption soon.
InpaintNeRF-360 is an amazing step forwards for editable NeRFs and I'm excited to give it a try as soon as it's available!