Two authors of the wildly popular 3D Gaussian Splatting paper are back and have released a new paper, but about NeRFs?
*Gasps in Twitter AI Bro*
As the paper states itself, platforms such as Luma have capturing guidelines that the paper authors refer to as, hemispherical captures. These guidelines, while incredibly robust and successful, prioritize object capture. For so many NeRFs, these are perfect. But where they begin to be improvable is general spaces and places that allow people to walk around and explore as they see fit.
If there were levels of NeRF subject difficulties, I would rank indoor rooms towards the higher end of the list. Evidently the authors felt the same way and thus emphasizes the crucial role of selecting optimal camera placements in their paper. It makes sense when you think about it. If the subject in a NeRF is an object, the capturer knows the most requested angles and can provide complete coverage of the space to generate an accurate reconstruction. This is paramount to getting NeRFs without artifacts.
Once you take that central focus away, how do you ensure that you are capturing the angle at which someone might want to view? Even with great information and reconstruction ability, there are greater risks of missing something. Obviously in a perfect world there are infinite cameras that will capture everything, but this isn't realistic.
So with that in mind, the authors try to solve the answer to the question:
we efficiently choose the next camera that will allow the resulting
ray sampling to be as close as possible to uniform in space and
angle?Original PaperTweet
They're acutely aware, that garbage in will result in garbage out, or at least extreme modifications to the underlying methods to salvage the dataset. It would be significantly easier to come up with a method that let's you retain the fidelity straight out of camera.
Previous methods ask the NeRF model to predict its own uncertainty level and use that inform the next step. However, this is extremely computationally intensive as requires several MLP evaluations for each camera being looked at. Despite that, they are successful at what they do, but are not set up to handle scenes that aren't focused around a definitive subject. For the sake of this paper, they are building on top of Instant NGP, because they believe it to have the best quality/performance trade-off.
From this, they have identified a few core components: Defining the observation frequency, the angular distribution, the reconstruction estimation, and then a greedy optimization algorithm to help determine the ideal camera locations.
The definition of an efficient metric for observation frequency and angular uniformity that can be computed on the fly during NeRF capture. Observation frequency is just like it sounds. Essentially, they take a point in space, labeled p, and the function describes how often that point has been seen. Once that has been defined, they move to the next step, angular uniformity.
Imagine that you are trying to create a sculpture based on a room, but you can only look through small windows around the room's walls. The more windows you have, and the more evenly spaced they are, the better your view and understanding of what the sculpture should look like. If you only have a few windows all on one side, you're missing a lot of details. With that in mind, how many of these windows allow you to see a particular part of the sculpture (Observation Frequency). If every window shows you the sculpture's nose, then you have a great view of the nose. If none show it, then you're missing out.
Now think of it as where the windows are placed. If they're all bunched up on one wall, you'll miss out on viewing the sculpture from other angles. Ideally, you'd want windows evenly spread out all around the room (Angular Distribution). So now that they have those two metrics, they need to quantify how successful it is. To pick one space to measure would defeat the purpose of a free-point NeRF, so they divide the area into a subsect of smaller three dimensional grids, labeled as B32.
For each tiny square in B32, they've created a formula to score how well you can see the part of the sculpture inside that square. This formula considers:
How many windows give you a view of it (Observation Frequency).
How varied the angles of those windows are (Angular Distribution).
The importance of observing less-seen parts more than parts you've seen a lot
It's like having a systematic way to judge how well you can draw each part of the sculpture using the views from your available windows. The better the scores, the better your final drawing will be. If you do know that certain areas, such as eye level, might be more important, say for instance, ceiling height, you can specify those regions.
Now that's been understood, they start looking at the fun part! Where to place new cameras. Obviously the camera wouldn't be able to be located within the statue or a place that impossible. Instant NGP is able to provide where those off limits locations are by utilizing their occupancy grid, which skips over empty space (camera location!). While as brilliant as that is, there's the immediate issue that you don't start out with any images or a trained occupancy grid. So what do you do?
They created an empty bounding box, which helps start filling in the blanks, as more footage is captured. They suggest a mere 20 photos to get the reconstruction and the coordinate system kicked off. Right now it takes a couple seconds for each subsequent camera placement to be suggested, but there is further optimization that can be added.
Now they want to choose the best spots to place cameras for the best view. Every so often during their process, they pick a new spot. They try out a lot of spots (1,000 to be exact) and then pick the one that gives them the best view according to their formula. This process is repeated until they’ve placed all the cameras they can or want to.
What's really cool to me is that they are able to compile this during the actual capturing of the NeRF and make adjustments based off that. That makes me think about how versatile the method can be, say if you're Google walking into a weirdly shaped business for the first time, to take a NeRF for Google Maps.
NeRFs can be computationally heavy, depending on the model that you are running. That's why they have also prioritized getting a method that does not introduce a large amount of incremental strain onto the system. Additionally, their algorithm is compatible with all publicly available NeRF codebases, so you can alter it depending on your computer situation.
Between this and Google's recently released CamP and Luma AI's Flythroughs app, it's clear that people's minds are shifting towards the creation of subject free spaces.
This has been one of my favorite papers of the year thus far and I'm very, very excited for the code to be released. I believe that there are several businesses that will be able to effectively leverage their algorithm to instruct either people or drones in spaces to get better captures.