Upload an image to the Hugging Face Space demo and download the resulting textured GLB mesh.
Run inference.py locally on an image to produce a 3D model with PBR textures.
Retrain the three-stage cascade on ObjaverseXL data with the included data toolkit.
Switch between the main branch (Trellis.2 backbone) and paper branch (Direct3D-S2) to reproduce SIGGRAPH submission numbers.
Requires Trellis.2 setup, a CUDA-matched natten build, and a custom utils3d wheel before inference will run.
Pixal3D is a research project from Tencent ARC Lab and Tsinghua University that turns a single image into a textured 3D model. The paper has been accepted to SIGGRAPH 2026. The headline idea is that earlier image to 3D systems passed image features into a 3D network in a loose way through attention layers, while Pixal3D instead lifts each pixel into 3D space by back projection. This gives the network a direct correspondence between what is in the 2D image and where it should sit in the 3D volume, which the authors say produces geometry and PBR textures close to the quality of a full 3D reconstruction. There is a hosted demo on Hugging Face Spaces where you can upload an image in a browser and download the resulting GLB mesh, without installing anything locally. The repository ships two branches: main, which is the improved version built on the Trellis.2 backbone, and paper, which is the original implementation on Direct3D-S2 used to produce the numbers in the SIGGRAPH submission. Local installation starts by following the Trellis.2 setup, then installing the project's own requirements, a natten build matched to your CUDA architecture, and a small utils3d wheel. Inference is run through inference.py with an image path and an output GLB path. A low_vram flag drops the default resolution from 1536 to 1024 and loads model components on demand, and setting ATTN_BACKEND=sdpa lets you skip flash_attn if it is not installed. There is also a Gradio web demo launched via app.py. For people who want to retrain the model, the training code is included and organised as a three stage cascade. Stage 1 trains a sparse structure model at 32 then 64 voxel resolution, stage 2 a shape model going from 256 up to 1024, and stage 3 a texture model on the same resolution ladder. Each stage uses pixel aligned projection conditioning and two view aligned latents by default. A separate data toolkit prepares O-Voxel data and rendered condition images from a source such as ObjaverseXL, and each higher resolution step is launched by pointing its config's finetune_ckpt at the checkpoint produced by the previous step. The repository is released under the MIT license.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.