NeRSemble Revolutionizes Human Head Rendering in Virtual Environments
Michael Rubloff
May 12, 2023
We recently saw NVIDIA's Live 3D Portrait, which made a large step forwards with rendering out novel views based upon a single reference. Now we have NeRSemble, otherwise known as Dynamic Neural Radiance Fields using Hash Ensembles, which continues the progress of making immersive faces out of NeRFs. This is particularly exciting for me, as I view one of NeRFs largest use cases as memory capture and preservation. This represents a massive step forwards into getting photorealisitic outputs from dynamic scenes of people, something that appears ridiculously difficult.
In the shorter term, this will mainly benefit game studios and VFX companies that are looking to photorealistically reconstruct a actor's face. Some challenges historically around human faces, have been the minute amount of details such as strands of hair, wrinkles, and view dependent reflections.
Researchers have recently made a significant breakthrough in rendering highly realistic human heads in virtual environments. The team has developed NeRSemble, a cutting-edge method that synthesizes novel views of human heads in complex motion with remarkable accuracy.
In order to keep the data universal, they employ the same script for each participant:
Specifically, our capture script consists of 9 expression sequences covering different facial
muscle groups, 1 hair sequence with fast movements, 4 emotion sequences, 10 sentences with audio, and 1 longer sequence where subjects are free to perform arbitrary facial deformations and head motions.
The research team from the University of Munich puts together 16 time synched cameras, shooting at a high frame rate to capture the microexpressions from milisecond to millisecond. Their shutter speed is roughly 1/340 of a second, but the focal length used is not disclosed. This resulted in a dataset of over 31.7 million imagesof more than 220 human heads, providing a vast and diverse range of facial dynamics, including head motions, natural expressions, emotions, and spoken language
NeRSemble is capable of reconstructing high-fidelity radiance fields of human heads, capturing their animations over time, and synthesizing re-renderings from novel viewpoints at arbitrary time steps. This remarkable achievement is made possible by combining a deformation field and an ensemble of 3D multi-resolution hash encodings.
Just like Instant-NGP, they use a multi-resolution hash grid to take advantage of the small memory footprint. Interestingly, they use continuous-valued alpha maps to help discourage NeRFSemble from modeling out the background of the captures. I am curious to see how this technique will be continued to be applied in future and existing methods. It appears to be a massive boon for those trying to accomplish product NeRFs.
The NeRSemble method outperforms state-of-the-art dynamic radiance field approaches by a significant margin, setting a new benchmark for human head reconstruction. The researchers' dataset, featuring an unparalleled combination of high-resolution, high frame-rate recordings of numerous subjects, is unmatched by any other dataset in the field. This dataset will be publicly released and will include a new benchmark for dynamic novel view synthesis (NVS) of human heads, which will help advance the field and increase comparability across methods.
This breakthrough has wide-ranging implications for various industries, including computer games, movie productions, virtual reality (VR), and augmented reality (AR) applications. The technology could significantly improve digital applications that rely on photo-realistic rendering of images from captured scene representations, such as immersive video conferencing, VR-ready avatar rendering, and studying microexpressions, among others.
As NeRSemble continues to be refined and developed, it is expected to play a crucial role in the growing importance of digital applications that require high-fidelity human head reconstruction and dynamic NVS.
In order to reduce floaters in their outputs, NeRFSemble uses three clever methods: using a very tight AABB in order to limit the amount of information processed, removing data that was seen by less than two cameras, and applying a low-pass filter to the density grid.
The end results in a photorealistic human head that can be shown dynamically as well as showcase the minutiae of human expression and features. However, some limitations do remain— when there is rapid hair movement, NeRFSemble does struggle and the same can be said for the interior of a mouth. Both of these are to be expected; my dentist will have to wait a little longer before taking a NeRF during my checkup.
The code has not been released yet, but the authors have said that both the code as well as the entire dataset will be published in the future for people to use. For those that want to download the dataset, you better get Dropbox Pro or a big Synology, because it comes in at a whopping 203 TB!
The excitement continues to grow, where substantial progress is made. As the additional papers are released, I grow more confident that an "overnight success" of NeRF is only a matter of time. The progress has been undeniable and what was previously held as challenging, continues to fall.