Live NeRF Video Calls

Michael Rubloff

Michael Rubloff

Oct 5, 2023

Email
Copy Link
Twitter
Linkedin
Reddit
Whatsapp
Radiance Field Video Call
Radiance Field Video Call

Catching up with my sister has been an exercise in bridging distances. She recently moved to Copenhagen, trading the familiar landscapes of our shared childhood for the charming streets of the Danish capital. Our interactions now mostly consist of FaceTime calls, where screens serve as a window to each other's lives. It's a decent solution, but sometimes the two-dimensional frames make me yearn for a more immersive experience.

This yearning for connection found an unexpected echo during my recent visit to SIGGRAPH, my first ever conference of this kind. It was akin to stepping into a digital art museum, with a vast hall adorned with pioneering works from people of various niches. Drawn, as if by an unseen magnet, to a section in the back left corner of the room, I stumbled upon a spectacle that seemed like something straight out of science fiction.

At first I wasn't sure what it was, but as I watched from a distance, I saw something incredible. It was live NeRFs being created from just a single webcam. Not only that, but people were having video calls with NeRFs. The screens that were utilized are from Looking Glass and are currently available to the public.

I took a look at this paper, Real-time Radiance Fields, back in early May, but I truly did not think I would get to experience it for myself so soon. I should probably learn better, as this tends to happen repeatedly. Rereading my article, I realized it was more about general excitement than how it actually works, so let's take a look at their method.

I found this statement to be particularly funny:


We note that inferring a canonicalized 3D representation (i.e., the inferred 3D representation is frontalized and aligned) from an arbitrary RGB image while simultaneously synthesizing precise subject-specific details from the input is a highly non-trivial taskReal-Time Radiance FieldsTweet

I would have to agree, based solely upon the pure words included in that sentence. In order to pursue this challenge effectively, they break it out into two steps: creating a canonicalized 3D representation of the subject from an image and to render high-frequency person-specific details. Canonicalization in this context refers to representing the subject in a ‘standard’ form, meaning that the generated 3D model is aligned and orientated in a consistent, predefined manner regardless of how the subject is posed or oriented in the original 2D image. These two goals make sense, given the overarching goal.

It begins with creating an hybrid encoder that combines convolutional neural networks (CNNs) and Transformer models, leveraging the strengths of both architectures. They use DeepLabV3 because of its speed. It extracts low resolution data of RGB images that are shown to it, which is then mapped to that original canonical representation. Once that is complete, the information is passed onto a Vision Transformer (ViT) and CNN.

This was a very conscious choice because the ViT is able to quickly map these high resolution outputs, in a similar way as a triplane representation and it allows for high resolution feature maps for the information to pass through from original input to representation.

While this is robust, there are still additional challenges with smaller details, such as strands of hair or birthmarks. In order to regain this information, they use a second encoder. This is different from the initial encoding, focusing more on high-resolution features and using only a single downsampling stage, aiming to capture more detailed information from the image. The final step is similar again, with the new information being passed forwards into another ViT.

Now they move onto the training stage. This is where a GAN called EG3D comes into play. We took a look at GANs as part of GANeRF, if you want to learn more about how they work.

EG3D serves as a crucial component in training the described encoder-based method, as it provides synthetic data that acts as a basis for supervising the new method. Its attributes and efficient design make it a reliable source for generating synthetic data to train the encoder, ensuring the quality and efficiency of the learned representations.

A latent vector is sampled and passed through a EG3D generator to yield a corresponding triplane, 𝑻, and images are rendered from various camera parameters, 𝑷. These parameters include focal length, principal point, camera orientation and position. For each step in the training process where the model is updated, two images of the same identity are synthesized.

EG3D is a sophisticated pretrained 3D GAN, proficient in rendering 3D-aware images, utilizing hybrid triplane representation and neural volumetric rendering, with end-to-end training and superior efficiency. Its role is pivotal in the training of the encoder, involving an adversarial process where the encoder’s representations are evaluated against the original synthetic images, focusing on various aspects like color accuracy, perceptual likeness, and fine details. This ensures the encoder accurately and efficiently produces detailed and canonicalized 3D representations from 2D images. The high-quality and efficient renderings of EG3D make it an ideal base for supervising the training of the new encoder method.

This alone works great with just synthetic data, but that wouldn't be very useful to someone on a video call. Because EG3D is pre-trained, it assumes fixed values for camera roll, focal length, principal point, and distance from the subject when rendering images. In order to have it translate to the real world, these camera parameters are chosen by sampling from random distributions, introducing variability and diversity in the training data. This makes the model more robust as it’s exposed to a wider range of perspectives and variations during training. This forces the model to learn from highly variable and challenging images, enhancing its ability to understand and adapt to different perspectives and details in real-world images.

With all of these steps, it's hard to imagine how this can all be accomplished and still run in realtime to support video conferencing. But shockingly, it takes 22ms on a A100 and 40ms on a 3090. The end results in a 24 fps transmission that is able to photorealistically showcase a live person.

Original on left. Output on right.

Watching the live demonstrations, my thoughts meandered back to my sister. The screens before me were not just about technological marvels; they held the promise of a future where I could feel closer to her. Where the faces on our screens could break the boundaries of two dimensions, giving a sense of presence that our current video calls couldn't. Profile views might still be a challenge for the technology, but I couldn't help but think of how enriching it would be to see my sister's expressions in real-time 3D.



The strides being made in this field are astounding. A normal webcam paired with a consumer-level GPU has the potential to redefine our virtual interactions, making the world feel a bit smaller. Though my sister and I are miles apart, advancements like these give me hope that, in our virtual conversations, those distances could soon feel trivial.


Featured

Recents

Featured

Platforms

Reflct Launches Open Beta with New Features and Updates

The 3DGS viewer is now in open beta, with some awesome features!

Michael Rubloff

Jan 19, 2025

Platforms

Reflct Launches Open Beta with New Features and Updates

The 3DGS viewer is now in open beta, with some awesome features!

Michael Rubloff

Jan 19, 2025

Platforms

Reflct Launches Open Beta with New Features and Updates

The 3DGS viewer is now in open beta, with some awesome features!

Michael Rubloff

Platforms

V-Ray Expands Gaussian Splatting Support to Cinema 4D

V Ray continues to add support to additional platforms with 3DGS.

Michael Rubloff

Jan 16, 2025

Platforms

V-Ray Expands Gaussian Splatting Support to Cinema 4D

V Ray continues to add support to additional platforms with 3DGS.

Michael Rubloff

Jan 16, 2025

Platforms

V-Ray Expands Gaussian Splatting Support to Cinema 4D

V Ray continues to add support to additional platforms with 3DGS.

Michael Rubloff

Platforms

OTOY OctaneRender 2026.1 Alpha features Gaussian Splatting

Another industry heavyweight is bringing a path traced 3DGS and Neural Radiance Caching to their 2026.1 Alpha.

Michael Rubloff

Jan 15, 2025

Platforms

OTOY OctaneRender 2026.1 Alpha features Gaussian Splatting

Another industry heavyweight is bringing a path traced 3DGS and Neural Radiance Caching to their 2026.1 Alpha.

Michael Rubloff

Jan 15, 2025

Platforms

OTOY OctaneRender 2026.1 Alpha features Gaussian Splatting

Another industry heavyweight is bringing a path traced 3DGS and Neural Radiance Caching to their 2026.1 Alpha.

Michael Rubloff

News

Sony Alpha 9 III and Radiance Fields

Sony's A9 III packs a full frame global shutter, making it an incredible tool for capturing radiance fields.

Michael Rubloff

Jan 14, 2025

News

Sony Alpha 9 III and Radiance Fields

Sony's A9 III packs a full frame global shutter, making it an incredible tool for capturing radiance fields.

Michael Rubloff

Jan 14, 2025

News

Sony Alpha 9 III and Radiance Fields

Sony's A9 III packs a full frame global shutter, making it an incredible tool for capturing radiance fields.

Michael Rubloff