Human Gaussian Splats (HUGS) from the Apple Machine Learning team is designed to create lifelike, animatable human avatars within a dynamic scene, is groundbreaking in its efficiency and detail. It's over 100 times faster AND it can also be incorporated directly into NeRF based scenes as well.
Like any good radiance field capture, it starts out with the actual intake of frames. Because HUGS is dynamic, we'll be using video as the input. You don't need a lot of data to get started; Human Gaussian Splats actually suggests somewhere between 50-100.
The first technological marvel comes into play once those original frames are captured. A pre-trained image to human pose and shape estimation model analyzes each frame to estimate the parametric body model, ‘Skinned Multi-Person Linear (SMPL)’ parameters. SMPL is a parametric body model which takes pose and body shape parameters as input and outputs a human body mesh, laying the groundwork for the avatar. A crucial step here is translating the observed human poses into something similar to a Vitruvian pose.
Why that pose? In the realm of 3D modeling, it's akin to a blank canvas. It's a neutral stance with arms stretched out, forming a 'T.' This pose is pivotal because it serves as a universal starting point, a standard from which all movements and deviations can be accurately mapped and calibrated.
At this stage, Linear Blend Skinning (LBS) weights come into play, predicting how each part of the Gaussian model should move to capture the nuanced details like clothing textures and hair, bringing the avatar closer to its real-life counterpart. Alongside the avatar, the static elements of the scene are also captured using 3D Gaussians, separate from the human model.
The 3D Gaussians, representing both the human and the scene, are then projected onto a 2D plane, akin to splattering paint onto a canvas, creating the final image. What makes Human Gaussian Splats truly remarkable is that this entire painting process is mathematically differentiable, ensuring that the learning and optimization of the model are precise and effective.
The model then enters an optimization phase, adjusting the Gaussian parameters by comparing the rendered image against the actual video frames. This phase involves a cocktail of technical metrics like L1 loss, SSIM loss, and perceptual loss, all working together to minimize discrepancies and refine the avatar's realism.
Post-optimization, the Human Gaussian Splats can be directly manipulated, allowing the model to animate the avatar in new poses seamlessly. Finally, the human and scene Gaussians put back together in final rendering, enabling the creation of dynamic, lifelike scenes from various perspectives.
The astonishing aspect is the speed of this entire process, taking only 30 minutes. You're probably thinking, perhaps 30 minutes on a H200? Well, this was actually trained on a 3090ti.
As of now, HUGS doesn't model environment lighting, and it struggles somewhat with loose clothing, such as dresses. HUGS unfortunately might not be making an appearance at the Met Gala, but that's roughly six months from now and at the pace we're going who knows what will be possible.
The Github page is not been launched yet, but will be updated here as it becomes available. Because this is coming from Apple, I wouldn't get too excited about what will be accessible, but there is an upside to that. We are continuing to march ever so closer to a potential Vision Pro release date and one can only imagine how Human Gaussian Splats might fit into a world filled with Spatial Videos.