
Michael Rubloff
Apr 4, 2025
At NVIDIA’s GTC conference, Sanja Fidler, VP of AI Research, shares how Radiance Fields, neural reconstruction, and large-scale simulation are reshaping how we capture, understand, and interact with the world.
This conversation was recorded live at NVIDIA's GTC conference.
Michael: I'm here with the VP of AI Research at NVIDIA, as well as the leader of NVIDIA's Toronto AI Lab—now rebranded as the Spatial Intelligence Lab. Some of the most incredible Radiance Field papers over the last few years have come out of this lab, and I’m really excited to be speaking with you today.
Sanja: I'm very excited to be here, Michael. One thing—we're actually rebranding ourselves as the Spatial Intelligence Lab because we've grown a lot in the past year. We now have a physics research team, perception, and more, so “Toronto” felt too narrow.
Michael: Yes, yes. I definitely want to talk about some of the spatial intelligence work coming out of your lab—it's incredible. Over the last few years, there's been this explosion of radiance field–related papers—like Instant-NGP all the way to 3D Gaussian Splatting. What's really been driving all this excitement? Why do you think people are so drawn to this world?
Sanja: Yeah, I mean—NeRFs came out in 2020, right? That really felt like a moment that transformed capture. Before, you'd take images or video and that was it—you could replay it, but that was all. Suddenly, you could turn that into an immersive experience. You could rewatch from a different trajectory, start interacting with the content. And it was cheap—anyone with a phone could record a scene and suddenly relive that moment.
Michael: Yes, exactly. Anything capable of taking a still image can now reconstruct the world into a lifelike scene. It feels like imaging is starting to evolve into a fundamentally new medium that just wasn’t possible before.
Sanja: That's right.
Michael: One of the really exciting leaps forward was the 3D Gaussian Splatting paper. One benefit was the real-time rendering, but because it's rasterized, you still get things like popping and other limitations. Your lab released a paper last year called 3D Gaussian Ray Tracing. I wanted to clarify—would you consider that part of the Gaussian Splatting family, or is it a different technology altogether?
Sanja: The representation is the same—you still have 3D particles living in space. They're Gaussians with covariance and view-dependent color. What changes is how rendering is formulated—going from rasterization to ray tracing. That’s the core shift. This allows you to do different things. Our motivation was dealing with real-world cameras. Rasterization is like an idealized camera, but we wanted to handle wide lenses, rolling shutters—stuff you find in real consumer cameras. For that, the formulation needs to change a bit.
Michael: And the results are super impressive. But I also spoke to a few of the authors and they mentioned how they intentionally left a lot the same as the vanilla Gaussian Splatting paper to allow for accurate benchmarking. But a lot of research done on splats still translates to ray tracing, right? I'm really excited for the future of these ray trace based approaches where you're starting to really leverage the capabilities of the ray tracing pipeline. On that same side, there's this desire to have these secondary lighting effects with ray tracing, but it's not possible with 3D Gaussian Splatting, so your lab then introduces another paper called 3D Gaussian Unscented Transform, which starts to bring some of these effects to rasterization. How exactly did you guys do that?
Sanja: Let me tell you about it because 3DGRT is essentially utilizing ray tracing, right? So it's utilizing NVIDIA Optix to create this, BVH, bounding volume hierarchy. Then you're essentially just basically doing ray tracing against those primitives, which makes it a few times faster, both in training and inference. So I think it's about two to three times slower to render plus Optix relies on NVIDIA RTX GPUs, right? So kind of specialized GPUs if you will. We want to be able to utilize all the GPUs, which is basically a nice thing about the Gaussian Splats, right? They can run on any GPU. That's why you get this crazy adoption. So we wanted to kind of change the formulation a little bit to still be able to utilize essentially all the GPUs to be also faster while all the same time also enabling all the secondary lighting effects. So the idea of the Unscented Transform is really going back to rasterization. So this formulation still utilizes rasterization. However, it does it a little bit differently. It uses this unscented transform, which basically, represents the Gaussian distribution with sigma points about a bunch of like carefully selected points. That you then transform to whatever. You can have any non-linear transformations. You can have very, a fisheye camera is a very non-linear transformation I think you need to do. Just transform those points and then you compute the statistics in that image space essentially.
Michael: Because in traditional Gaussian Splatting, like, you really need that ideal, pinhole sensor, to get it to work. With a lot of these other camera sensors that you might find across various industries, they're not always using a sensor like that.
Sanja: That's right. So you need to be able to deal with nonlinear functions, non-linear projection. In addition to that, we also handle rolling shutter. So when the camera is basically scanning column by column, row by row. You can deal with that in the same way. Which basically means that it's a stabilization base so it can run on all the GPUs. It's as fast as Gaussian splats. You get a little bit slower training time, maybe like 20%, but you get all the benefits of the Gaussian splats. And at the same time, we align the representation with 3DGUT, so that even though you reconstruct with this GUT, you can still potentially render that with the GRT if you have this additional RTX hardware.
Michael: It's pretty incredible to see 3DGRT come out and it was like, okay, you can't get ray trace based effects in rasterization. Then a couple months later and now it's now possible, which is incredible to watch this be published. I want to also talk about the industrial applications of these radiance fields. And it seems like one of the first places it's starting to make an impact has been in autonomous vehicles. Why has it been a useful representation in terms of just general radiance fields across any of the different models?
Sanja: One thing that really is important for robotics generally is scalability of simulation and the ability to go from a real world event to simulation. And then basically when your software changes, try again. Imagine that I made a mistake. I have a robot that made a mistake in the real world. I can fix my software to see whether I fix that mistake. All right, so going from real to sim is a super important capability. That before was not possible, right? So when NeRF came out, that was like that moment, you know, that moment we said, okay, that's the killer use for this, to make this a reality, right? Because before the simulation for any of these robotics was synthetic, right? You had to have artists create all these different contents, even if it was procedural, you had to render it and the gap from that world, static world to the real world was significant. And NeRF kind of brings these two closer together. We really thought this was an opportunity to build this closed loop simulation capabilities going from real to sim.
Michael: So to follow up on that too, there's definitely no doubt that the more that you can ingest, whether that's either images, or video, or different sensors, is beneficial to robotics. So one of the things I wanted to ask you is like why is it important to create larger ingestion engines that can understand different modalities? Take that paper the Neural Lidar Fields, for instance, where you're able to ingest more modalities of data to get a more accurate simulation, why is that so important to the world of simulation and what challenges still exist in that world to reconstruct?
Sanja: If you go just from video, then everything you're gonna get is non-metric. And for robotics? A car that's moving has to be metric. You have to have that to be really good, as well as the geometry needs to be really good. You need to simulate a car with full physics actually driving. If there is a speed bump, you need to be able to simulate that effect. So the higher precision, the 3D, the geometry, the more accurate this interaction with the robot and the environment is going to be. And LiDAR is essentially bringing that additional signal, the 3D signal that helps it reconstruct much better. And so one of the conversations I had recently was with the team.
Just one more thing, like, because you also mentioned LiDAR fields, so there's two different things here. Using LiDAR to make NeRF better or Neural Reconstruction better. But there is also simulating this other sensor for the robot. For example, if you take L4, L3, L4, L5 systems, they all have LiDAR. All right, so if I want to build a simulator, it's no longer only important to simulate cameras. I also need to simulate all the other sensors I have.
Michael: Say you're going to simulate a place in London versus Manhattan versus Los Angeles, how does the actual location affect the simulation capabilities. For instance, like, just driving on this specific side of the road or just perhaps like, people in Manhattan might be more aggressive drivers than those in Los Angeles?
Sanja: It doesn't affect Neural reconstruction at all. It's not just video footage. So I'm going get my 3D environment, irrespective of whether I'm in London or Slovenia, where I'm from or Toronto, or here, in California. What changes is how you simulate traffic behavior. So what we call traffic models. Which is positioning all the cars or humans. Need to basically be adapted to the location that you're in. So you need to have either people scripting the logic. That was kind of the old style approach and the new modern ways. You collect some of the data and you train these agents to behave based on that location. So it's really that traffic that changes. And another thing that is important to change is basically evaluation. Because ultimately this simulation is used to test the car. Is it ready to go on the road or not? Is it making mistakes? So you need to have some sort of metrics come out of the simulator. How many times they break the rule, right? And for that I need to know whether it's left or right driving.
Michael: So speaking of Slovenia, which is where you grew up. I actually do have a surprise related to that. And so while I was preparing for this chat I was watching your talk at the Vector Institute a couple months ago, and you mentioned that you're a big fan of Luca Doncic, who is also from Slovenia. I don't know if you know this, but he's actually one of the few NBA players who has already been NeRFed. I have actually a bunch of NeRFs of Luca here that one of our friends captured. It was a combo of NeRF, with Nerf guns, and Luca. And so there's a bunch of different captures of him.
Sanja: Oh wow! I went to a game and I was like, “Luka! Luka!” That bullet one surprises me because dynamic reconstruction isn’t that good yet. I’m not sure how they did that. Maybe they captured the data and then inserted it in post, rendering novel views or something. That's a great surprise. Next time you can set up a chat with him.
Michael: We'll bring him to the Toronto office! We need to continue to get more reconstructions out there. So another paper that I want to chat about is called Gen3C which is an incredible paper. It seems like it can pair really nicely with some of the other existing ones within NVIDIA's ecosystem like Cosmos where you can start to really generate very lifelike, in this case 2D, but now be able to bring this into a larger 3D environment. So I was wondering if you could talk a little bit about Gen3C.
Sanja: So Gen3C in principle has the same goal as we have in neural reconstruction. Which is basically taking a capture, which could be actually generated or recorded video. And now you want to create a novel view. You know, renders. Only this time it's basically done with a generative model instead of overfitting with NeRF. And one wonders how far can we go with those technologies? And the other thing is like, this neural reconstruction is fundamentally limited. You can only reconstruct things that you see, that you capture. So to reconstruct Luca Doncic, I need to go with the camera all around you. And in autonomous driving, you don't have that luxury. You're driving at high speed and with weather like fog, and snow. It's hard, right? So NeRF or any of these neural reconstruction techniques, really struggle. Plus, one is just struggling if it's night or certain like adverse weather conditions. The other thing is just they cannot outpaint, they cannot reconstruct things you don't capture. And these methods can basically hallucinate things, right, in a very plausible way. But they still retain that temporal consistency or the ability, this could potentially be a place in the real world. You know, the next Gaussian splat might just be something like this. Obviously it's like our first trial. It's (currently) slow and it has a bunch of limitations, but potentially this is the way to go.
Michael: If I currently go to my phone's camera app and I hit the shutter, it instantaneously captures that 2D moment. Do you think that there will be a moment in the somewhat like near to medium term future where these 3D reconstructions are happening instantaneously?
Sanja: You mean like this generation is real time?
Michael: Yeah, where it's the pure raw 2D images and then as soon as a button is pressed, everything is reconstructed. Because right now we can, we have, like, the luxury of watching the capture shift and morph into this 3D approximation.
Sanja: There's this, like, larger reconstruction models, that are kind of going in this direction. It's like real-time reconstruction. Yes. It's getting there I would say. Obviously not the same quality yet if you're ready to run and optimize.
Michael: I feel like we're in almost like the film photography era of Radiance Fields where it's like we don't necessarily know what it's going to look like until we've gone and processed it and we have to wait until it forms. But I feel like we're not terribly far away from that future where it's similar to how we interact with digital cameras now. Once we capture it, it's already reconstructed. It should be instantaneous, right?
Sanja: That should definitely be the future.
Michael: And I'm curious, too, like, how far away do you think we are from reconstructing the entire world in lifelike 3D? You have the paper, NeRF-XL, for instance, which allows you to kind of train an infinite number of images or across an infinite number of GPUs. How far away do you think we are from just being able to reconstruct the entire planet should that dataset exist?
Sanja: At least for autonomous driving, we need to move towards this. Going from, because a typical Gaussian splat would afford something like thirty seconds of simulation. Meaning that I can only test my car in about thirty seconds, right? And obviously in the future, maybe I just want to have it drive in San Francisco, a replica of San Francisco. With a bunch of little AV agents going around and checking for corner cases, right? So that's definitely where you want to move. You can do that trivially today in some sense by basically chunking. You take a very long capture and just chunk it into thirty seconds, whatever you can handle. And then obviously what's gonna happen is at the borders, the boundaries between one chunk to another, you will have this like a glitch.
Michael: What in this realm of large scale possibility might be in the somewhat not too far future?
Sanja: Yeah, I mean like with fVDB, NeRF XL was kind of the first option, right? It's really, we really want to go large scale. We want to basically do a multi GPU that does proper integral across a very long environment. And allows you to do a very long capture reconstruction.
Michael: What's really cool about FVDB, too, is the acceleration structure compared to Nerfacc, which is something that Ruilong Li built. And it scales much nicer where it's like we have the ability to create very scalable representations for, for some of these very large scale and enterprise grade tasks.
Sanja: VDB was originally invented for VFX. VDB, because it's a really, really efficient spatial index for this very sparse volumetric data. It was invented by Ken Museth. Who got a bunch of Oscars, by the way.
Michael: And now he's like part of your lab as well, right?
Sanja: Who's better to bring machine learning to this awesome representation than him?
Michael: So why is it so important that we have a machine learning framework like fVDB for just the general world?
Sanja: You want to enable developers as much as you can. So if your friends want to reconstruct a very large area, they shouldn't be writing some custom way of doing that. It should just be a one button click and you have this framework that just allows it. You give it three GPUs, five GPUs, whatever, and it just does it for you, all right, no matter the size of that environment. These frameworks are essentially trying to make development easier. Scalable. I think that there's going to be an immense demand and desire to do this because as it becomes more readily apparent that we can now reconstruct areas that are this large, we really will need a framework to support and allow people to build on top of and develop out those functionalities because in a lot of ways, this is just uncharted territory for people at large. And I'm very excited about this. But it's actually even more than reconstruction. Reconstruction is just one thing you want to do with the environment, right? You also want understanding. Imagine for autonomous vehicles we talked about. You want to map the entire San Francisco. And if you don't have a dedicated framework that can really become memory intensive. So you want to have a framework where all this is sufficient. So it's a bunch of different tasks you want to do in 3D.
Michael: I'm very curious about Sovereign AI. And I'm very curious about translating Sovereign AI data into imaging and being able to actually reconstruct very large areas of some countries and I feel like that's going to be very integral with some of these use cases. I also want to ask you a bit about what directions of the research world are you most excited about right now?
Sanja: So, I mean, we have two big projects. One is what we call a neural reconstruction engine, which is what's driving all of this research. Which is building up this closed loop simulation for robotics. We started with AV and now we actually want to extend it also to support robotics or other different domains, right? And then there's a whole bunch of problems you need to solve there. Like we talked about, it needs to become more and more generative as well, right? Because there's more and more stuff. If I originally went straight in the intersection, maybe next time I want to go right or left. I don't want to be limited to one pathway. Right. And I might want to do a lane change of a few meters as opposed to like two or three. Supporting that, like you're saying, large scale with this generative outpainting, it's really something we need to push for. And I guess related to that, we're also working on Cosmos. There's a bunch of things that are going to be released tomorrow on that front, which tackles it purely on generative perspective. You give it some prompts which could be text or it could be other prompts you guys are gonna see tomorrow in the keynote.
Michael: As a follow up to that, if you could solve just any one open problem in the world of either simulation or neural reconstruction, what would that be?
Sanja: I want to make Cosmos a real Physics based simulator. We need to keep pushing that and it needs to become extremely fast. We need to support very dense interaction with the robot because for example, you can do a couple of autoregressive steps to generate a long video. A robot is going to trigger an action. So then you cannot even do a second of this right before you degrade your quality. There's still mileage to go. That's my kind of focus. Really make that good. Also like Cosmos and right now it doesn't give you any information about the forces applied to the robot. It's just a video. I will not be able to interact with the world. Not just for viewing. Physical touch points. I feel like each LEAP is just like so, like science fiction almost, but we're now approaching the point. The thresholds for each one of these where it's becoming now feasible to solve this and then be able to like pick up or be able to interact and extend beyond just like visualization. So if I want to put my glasses or whatever on and now maybe NeRF or Cosmos is streaming video. I want to be able to also get some tactile feedback out of that. Either for a robot or me.
Michael: There's two papers, one called LERF language embedded radiance fields and Garfield, group anything with radiance fields that always seems to blow people's mind because people just assume that like the peak of this technology is very lifelike visualization. But in a lot of ways, it feels like that's the foundational point. That's a starting point. There's so much that we can build on top of and be able to extend and interact with and derive value out of that I think has been very underexplored so far. But I'm very excited for this future because it seems like it's already on its way. It's just kind of behind the scenes.
Sanja: This video models or, I guess like the foundation models need to meet reconstruction and you know, be grounded in physics. How you do that, that's the next year or two.
Michael: Is there like a technological moment or capability that you thought that you would not see in your career? And if we haven't got there yet, like what would be that moment for you?
Sanja: I don't know, this is going so fast, honestly. I guess the LLM was definitely a surprise. I think with ChatGPT, that moment was a surprise, I guess, to everyone how quickly it came.
Michael: I remember the first couple prompts that I put into ChatGPT just to test how strong this actually is and then just being like, oh, no, it actually got this answer pretty accurately.
Sanja: We had the first sentence in bedding. You know, the first little LLM came out of our lab at U of T. I was like a fresh faculty. And I went to the class and I gave it to the student and we're going to generate a book. That was like a class project. And then like, it was kind of already doing some gibberish, but seven years later, it being this incredible, that was to me a surprise.
Michael: I remember as well that you were mentioning in a previous lecture about the generative audio’s capability to generate songs and how that was able to impact people and people didn't know what to do with the fact that a computer was able to generate lyrics, vocals, and melody to put into. And all of that was possible.
Sanja: It's not like I didn't think it's going to happen in our lifetime, but it came much earlier.
Michael: Extending that to where we are right now is now we have the capability of translating imaging out of 2D, into a very lifelike 3D world. And so I'm kind of curious over the somewhat short to medium term, say I don't know, three to seven years from now. Do you think that the way that we as humans interact with and create content and interact with the world will shift into a 3D capacity?
Sanja: You mean like if it's going to be 3D versus just 2D?
Michael: For instance, having such a strong 3D approximation of a moment that if you want to go visit GTC 2025, as opposed to just looking at a 2D image of like the banner outside that you can actually step into and explore and interact with a very lifelike 3D version of that given moment.
Sanja: This is what all this stuff is building towards. I would say seven years is a long time. So I think we're gonna see that earlier. Mass production and adoption, that's a different story, but the technology is going to come earlier than seven years from now.
Michael: I agree, and I think that we're starting to see the first bridge of industry adoption of some of these lifelike 3D representations through, like, different radiance fields. Right now. I'm also hopeful that it will be sooner than seven years. I think it will be closer to three.
Sanja: And you think this is what people would like?
Michael: I think so.
Sanja: You talk to the audience all the time, right?
Michael: Yeah. I think that this allows people to really explore a given moment in time. And I think that this has just been the natural journey of imaging. It's taken us almost 200 years to get to where we are today, but I think that we're now unlocking the fundamental or foundational pieces to now extend into a very lifelike 3D future and I think it's just gonna get exponentially stronger from here. That's my hope. Well, I really appreciate your time today, Sanja. And yeah, thank you so much for sitting down with me today. I am looking forward to seeing the work that your lab continues to put out.
Sanja: Yeah, I'm happy to chat again. Thank you, Michael.