TriplaneGaussian joins a steadily rising group of platforms that are enabling fast reconstruction from single view image to 3D. TriplaneGaussian comes out of VAST AI Research and previously was only available through a Gradio demo in a Hugging Face Space. It was quick to rise, appearing as one of their Spaces of the Week and featured on Hugging Face's trending page. They have several preloaded examples that you can try from or you can let your imagination run free and see what you can generate.
Originally, they tried to have their method directly predict normal 3D Gaussians, but ran into trouble when they tried having it generate models off a single image. With that in mind, they pivoted towards a hybrid representation that combines the benefits of triplane and point cloud approaches.
In order to power it, they use two transformer networks: a point decoder and a triplane decoder. The point cloud decoder gives a rough approximation of the object's geometry. That rough approximation has local image features integrated with Projected Aware Conditioning. This step is really critical as it results in a high quality point cloud that represents the original input image.
Triplane Gaussian then passes it to a Triplane Decoder.
This decoder employs a deep, ten-layer structure, enabling it to extract more nuanced features from the data. It works by analyzing the positional relationships within the image data, effectively learning how different segments correlate to specific 3D coordinates.
A key element of this process is the integration of point cloud data into the model. By encoding this data into the system’s learnable positional embeddings, the model achieves a heightened level of geometric awareness. This means it can better understand the shapes and contours of the 3D space it's representing.
Enhancing this spatial understanding is augmenting point cloud data with projection features derived from the input images. Further refining the model’s output is the application of PointNet, a neural network specifically tailored for processing point cloud data, combined with local pooling techniques. Local pooling helps in distilling the vast amount of data into more manageable, yet still meaningful, representations.
The final, and perhaps most intriguing, step involves an orthographic projection of these features onto three axis-aligned planes. Orthographic projection, a method of displaying 3D objects in two dimensions, is employed here to align the 3D data with the respective X, Y, and Z axes. This alignment is critical for maintaining the integrity of the 3D structure in a 2D framework.
Once projected, the features that land on the same plane are pooled together and enhanced with the model’s learnable positional embeddings. This step is an alignment of the detailed imagery with the structured point cloud data, resulting in a highly accurate 3D representation.
It's really a lot of fun to play around with and if you download the 3D generation, it will be a .splat file, meaning you have quite a few options of what to do with it. The download instructions are located on their Github or you can experiment directly on Hugging Face. At the time of posting there is no licensing information.