It had been too quiet recently for Jon Barron and Ben Mildenhall.
The team from Google Research have made significant progress in accelerating Neural Radiance Field (NeRF) training with Zip-NeRF by combining two previously incompatible techniques, Mip-NeRF 360 and Instant NGP (iNGP). The integration reduces error rates by 8% to 76% compared to prior techniques and trains 22 times faster than Mip-NeRF 360.
One of Google's previous papers, Mip NeRF-360 had long been a top standard within NeRF. However, it's incredibly computational taxing to the point of not being usable for a consumer and the training times were extremely long.
When I first viewed the demo video, I kept thinking, this is going to take forever to train. From all the posts I've seen this morning, people cannot believe that this technology exists; what I would like to focus on is the time in which this took to generate: ~55 minutes. The quality and area that were able to be achieved sub 1-hour is quite frankly nothing short of amazing. It's also important to note how well Zip-NeRF is able to process exposure. When you examine the video from approximately 1-1:40, Zip-NeRF handles highlights significantly better than both Mip-NeRF 360 and Instant-NGP.
The photorealistic hallmark of NeRFs is on full display here and my phone has been buzzing with people talking about how Zillow, Apartments.com, and other platforms can utilize NeRF. Yes, it's all coming and you can believe that NeRFs will be at the front of it.
With Google recently rolling out Immersive View, it becomes apparent how strong their foundation is for showcasing a place of business's atmosphere. NeRFs appear to be directly headed to transforming the way that humans experience a space. Additionally, with Zip-NeRF, atmosphere and mise en scène will become even more important.
NeRF involves training a neural network to model a volumetric representation of a 3D scene, allowing for novel views of the scene to be rendered via ray-tracing. The original NeRF model utilized a multilayer perceptron (MLP) to parameterize the mapping from spatial coordinates to colors and densities. While this approach is expressive, MLPs are slow to train, which has led researchers to accelerate training by replacing or augmenting MLPs with voxel-grid-like data structures.
Instant NGP (iNGP) is one such example. It uses a pyramid of coarse and fine grids to construct learned features processed by a tiny MLP, which greatly accelerates training. The original NeRF model also suffered from aliasing, as it reasons about individual points along a ray, resulting in jaggies in rendered images and limited ability to reason about scale. Mip-NeRF addressed this issue by casting cones instead of rays, and by featurizing the entire volume within a conical frustum for use as input to the MLP. Mip-NeRF and its successor, mip-NeRF 360, showed that this approach enables highly accurate rendering on challenging real-world scenes.
Unfortunately, the progress made on fast training and anti-aliasing are not easily compatible. Mip-NeRF's anti-aliasing strategy depends critically on the use of positional encoding to featurize a conical frustum into a discrete feature vector. In contrast, current grid-based approaches like iNGP do not use positional encoding and instead use learned features obtained by interpolating into a hierarchy of grids at a single 3D coordinate. This creates a challenge in adapting anti-aliasing approaches from rendering to grid-based NeRF models like iNGP.
To address this issue, the researchers leveraged ideas from multisampling, statistics, and signal processing to integrate iNGP's pyramid of grids into mip-NeRF 360's framework. They called their model "Zip-NeRF" due to its speed, its similarity with mip-NeRF, and its ability to fix zipper-like aliasing artifacts. On the mip-NeRF 360 benchmark, Zip-NeRF reduces error rates by as much as 18% and trains 22 times faster than the previous state-of-the-art. Compared to Instant-NGP, the results are even more staggering: 26%, 41%, and 36% reductions in RMSE, DSSIM, and LPIPS.
On a multiscale variant of that benchmark, which more thoroughly measures aliasing and scale, Zip-NeRF reduces error rates by as much as 76%.
Mip-NeRF 360 and iNGP differ significantly in how coordinates along a ray are parameterized. Mip-NeRF 360 subdivides a ray into a set of intervals, each representing a conical frustum whose shape is approximated with a multivariate Gaussian. The expected positional encoding with respect to that Gaussian is used as input to a large MLP. In contrast, iNGP trilinearly interpolates into a hierarchy of differently-sized 3D grids to produce feature vectors for a small MLP. Combining these two approaches introduces two forms of aliasing that the researchers had to address.
First, Instant NGP's feature grid approach is incompatible with mip-NeRF 360's frustum-based approach. To solve this issue, the researchers adopted a multisampling strategy, wherein they sample multiple points along the ray within each frustum. They then average the feature vectors obtained from the grid hierarchy for each of these points, resulting in a multiscale feature representation for the entire frustum.
The second challenge was that iNGP's trilinear interpolation does not produce feature vectors equivalent to mip-NeRF 360's positional encoding. To address this, the researchers introduced a differentiable estimator for the expected value of a trilinearly interpolated feature vector. This estimator combines the grid-based features and the expected positional encoding within each frustum. It also allows them to use a small MLP, like in iNGP, while still benefiting from mip-NeRF 360's conical frustum-based approach to anti-aliasing.
By combining these two solutions, the researchers created a novel method that integrates the best of both mip-NeRF 360 and iNGP, resulting in significantly improved training times and reduced aliasing issues. This new model, Zip-NeRF, is capable of producing high-quality, photorealistic renderings while minimizing artifacts such as jaggies and missing scene content.
The success of Zip-NeRF opens up new possibilities for the application of NeRF in various fields. Its fast training times could enable more rapid development and deployment of generative media, virtual reality environments, and robotic perception systems. Additionally, the reduction in aliasing artifacts improves the overall quality and realism of rendered scenes, potentially benefiting applications in filmmaking, video game design, and computational photography.
Moreover, the researchers' approach to combining mip-NeRF 360 and iNGP could inspire further research into the integration of disparate techniques in the field of neural rendering. The innovative solutions they employed to overcome the challenges posed by integrating these two methods demonstrate the value of exploring hybrid models and drawing from multiple domains to achieve optimal performance.
This breakthrough brings together the benefits of Mip-NeRF 360's anti-aliasing and iNGP's accelerated training. The resulting model, Zip-NeRF, offers a faster and more accurate approach to NeRF training, with applications in various fields such as robotics, computational photography, and generative media.