Research

CAT3D Pounces on 3D Scene Generation

Michael Rubloff

Michael Rubloff

May 17, 2024

Email
Copy Link
Twitter
Linkedin
Reddit
Whatsapp
CAT3D
CAT3D

We very recently were looking at RealmDreamer, which generates scene from prompts. Just over a month later, CAT3D, short for "Create Anything in 3D," has emerged and takes things up a notch or two.

CAT3D leverages advanced multi-view diffusion models to generate highly consistent novel views from any number of input images going all the way down to one. These views are then processed using robust 3D reconstruction techniques to produce detailed 3D representations that can be rendered in real-time. Remarkably, CAT3D can create entire 3D scenes.

For all the impressiveness, there's really two main steps to CAT3D's approach: generating novel views and 3D reconstruction.

The model begins by taking conditional views as input, with each view comprising an image and its corresponding camera pose. Each input image is then encoded into a latent representation using an image variational auto-encoder. This transformation reduces the high-dimensional image data into a more manageable lower-dimensional latent space, facilitating easier processing by the model.

The diffusion model captures the joint distribution of target images based on their camera parameters. It predicts the latent representations of the target images from the input images and their camera poses. To ensure consistency among the generated views, the model employs 3D self-attention layers that connect the latents of multiple input images.

Camera poses are encoded using a raymap, which records the ray origin and direction at each spatial location. This representation is invariant to rigid transformations, ensuring that the generated views maintain accurate spatial relationships.

Once the multi-view diffusion model has been trained, it can generate a large set of synthetic views to cover the entire scene. This includes designing camera trajectories that ensure thorough and dense coverage of the scene. These trajectories must avoid passing through objects and maintain reasonable viewing angles. Four types of paths are explored: orbital, forward-facing circle, spline, and spiral trajectories.

The target viewpoints are clustered into smaller groups based on their proximity. The model generates each group independently, ensuring local consistency within each group and long-range consistency between groups.

For single-image conditioning, an autoregressive strategy is used. Initially, a set of anchor views is generated to cover the scene. Subsequent views are then generated in parallel, using the observed and anchor views as conditioning inputs.

If the base input is a single image, they look to generate 80 images to cover the area and when there's more, that range falls around 460-960 images.

Unsurprisingly, they build upon Google's in house NeRF method, Zip-NeRF, but with some additional modifications. They include a perceptual loss (LPIPS) between the rendered image and the input image, which helps in preserving textures and fine details, but also ignores potential inconsistencies. The losses for generated views are weighted based on their distance to the nearest observed view. This approach ensures that views closer to the input images have a greater influence on the reconstruction process.

Impressively, CAT3D can create full scenes in roughly about a minute. We need to recognize that the ceiling on this technology is still unknown. Only a year ago, reducing generation time from hours to a single hour was impressive, and that was for individual objects.

Cat3D also benchmarks to Reconfusion, which was an awesome paper from the Google team late last year, that explored diffusion priors. CAT3D exceeded ReconFusion's fidelity in every experiment run.

CAT3D also benchmarks against ReconFusion, an impressive paper from the Google team last year that explored diffusion priors.

To train CAT3D, you do need 16 A100 GPUs, but I do not believe the takeaway is that you need a large scale workstation to run this. I continue to believe that these are very positive indicators of the ability of the methods to function and to be quickly optimized. With Google also recently announcing Veo, I have to imagine how the two might play into one another.

For more information visit the CAT3D project page. They additionally have some interactive demos of outputs that were converted to Gaussian Splatting for people to try out!

Featured

Featured

Featured

News

GeoWeek to Present Free Webinar on Radiance Fields

The event promises to be an engaging and informative session with a strong lineup of panelists.

Michael Rubloff

Jun 28, 2024

News

GeoWeek to Present Free Webinar on Radiance Fields

The event promises to be an engaging and informative session with a strong lineup of panelists.

Michael Rubloff

Jun 28, 2024

News

GeoWeek to Present Free Webinar on Radiance Fields

The event promises to be an engaging and informative session with a strong lineup of panelists.

Michael Rubloff

Platforms

Luma AI Launches Keyframes for Dream Machine

The Luma team is back again with new features for Dream Machine

Michael Rubloff

Jun 27, 2024

Platforms

Luma AI Launches Keyframes for Dream Machine

The Luma team is back again with new features for Dream Machine

Michael Rubloff

Jun 27, 2024

Platforms

Luma AI Launches Keyframes for Dream Machine

The Luma team is back again with new features for Dream Machine

Michael Rubloff

Platforms

Volinga Updates Unreal Engine Plugin

Several exciting features, including 5.4 support, relighting, and more come with the update!

Michael Rubloff

Jun 27, 2024

Platforms

Volinga Updates Unreal Engine Plugin

Several exciting features, including 5.4 support, relighting, and more come with the update!

Michael Rubloff

Jun 27, 2024

Platforms

Volinga Updates Unreal Engine Plugin

Several exciting features, including 5.4 support, relighting, and more come with the update!

Michael Rubloff

Research

3DGS-DR's Deferred Reflections

3DGS-DR steps up reflection fidelity, while remaining viable on consumer GPUs

Michael Rubloff

Jun 27, 2024

Research

3DGS-DR's Deferred Reflections

3DGS-DR steps up reflection fidelity, while remaining viable on consumer GPUs

Michael Rubloff

Jun 27, 2024

Research

3DGS-DR's Deferred Reflections

3DGS-DR steps up reflection fidelity, while remaining viable on consumer GPUs

Michael Rubloff

Trending articles

Trending articles

Trending articles

Platforms

Nerfstudio Releases gsplat 1.0

Just in time for your weekend, Ruilong Li and the team at Nerfstudio are bringing a big gift.

Michael Rubloff

Jun 7, 2024

Platforms

Nerfstudio Releases gsplat 1.0

Just in time for your weekend, Ruilong Li and the team at Nerfstudio are bringing a big gift.

Michael Rubloff

Jun 7, 2024

Platforms

Nerfstudio Releases gsplat 1.0

Just in time for your weekend, Ruilong Li and the team at Nerfstudio are bringing a big gift.

Michael Rubloff

Platforms

Google CloudNeRF: Zip-NeRF and CamP in the Cloud

It doesn't seem like a lot of people know this, but you can run CamP and Zip-NeRF in the cloud, straight through Google and it's actually super easy. It’s called CloudNeRF.

Michael Rubloff

May 8, 2024

Platforms

Google CloudNeRF: Zip-NeRF and CamP in the Cloud

It doesn't seem like a lot of people know this, but you can run CamP and Zip-NeRF in the cloud, straight through Google and it's actually super easy. It’s called CloudNeRF.

Michael Rubloff

May 8, 2024

Platforms

Google CloudNeRF: Zip-NeRF and CamP in the Cloud

It doesn't seem like a lot of people know this, but you can run CamP and Zip-NeRF in the cloud, straight through Google and it's actually super easy. It’s called CloudNeRF.

Michael Rubloff

Tools

splaTV: Dynamic Gaussian Splatting Viewer

Kevin Kwok, perhaps better known as Antimatter15, has released something amazing: splaTV.

Michael Rubloff

Mar 15, 2024

Tools

splaTV: Dynamic Gaussian Splatting Viewer

Kevin Kwok, perhaps better known as Antimatter15, has released something amazing: splaTV.

Michael Rubloff

Mar 15, 2024

Tools

splaTV: Dynamic Gaussian Splatting Viewer

Kevin Kwok, perhaps better known as Antimatter15, has released something amazing: splaTV.

Michael Rubloff

Research

Mip-Splatting: Anti-Aliasing for Gaussian Splatting

If you've been trying out Gaussian Splatting, you may have noticed that the scene tends to degrade quickly, especially when you pull outside of the capture path or change the focal length, compared to NeRF.

Michael Rubloff

Nov 28, 2023

Research

Mip-Splatting: Anti-Aliasing for Gaussian Splatting

If you've been trying out Gaussian Splatting, you may have noticed that the scene tends to degrade quickly, especially when you pull outside of the capture path or change the focal length, compared to NeRF.

Michael Rubloff

Nov 28, 2023

Research

Mip-Splatting: Anti-Aliasing for Gaussian Splatting

If you've been trying out Gaussian Splatting, you may have noticed that the scene tends to degrade quickly, especially when you pull outside of the capture path or change the focal length, compared to NeRF.

Michael Rubloff