This repository packages a research model called VGGT-Omega as a plug-in for FiftyOne, an open-source tool for managing computer-vision datasets. VGGT-Omega itself was published at CVPR 2026 by Meta AI and Oxford VGG; given a video, it estimates a depth map for every frame and merges them into a single 3D point cloud of the filmed scene, all in one forward pass through the network. The README points out that this skips the older multi-step pipelines of iterative refinement or Structure-from-Motion. What you actually get for each video in your dataset is two things. Per-frame depth maps land under sample.frames[i]["depth_map"] as FiftyOne Heatmaps you can overlay in the FiftyOne App, and the merged 3D scene lands under sample["scene_3d"] as a path to a .fo3d file you can open in FiftyOne's built-in 3D viewer. To use it, you pip install the model code from Meta's facebookresearch/vggt-omega repo plus a handful of dependencies including fiftyone, open3d, einops, safetensors, huggingface_hub, and opencv-python. Then you register this GitHub repo as a zoo source and call foz.load_zoo_model to load the facebook/VGGT-Omega-1B-512 checkpoint with parameters like confidence_threshold, video_sample_fps, max_frames, preprocessing_mode, and image_resolution. The README gives concrete A100 memory benchmarks for the max_frames setting: about 7GB at 16 frames up to about 21GB at 200 frames. The actual inference call is dataset.apply_model(model, "depth_map") after compute_metadata() so the loader knows each video's frame rate. The README then walks through building a grouped dataset that lines up the depth overlays and the merged 3D point cloud side by side, so when you launch the FiftyOne App you can switch between a video slice (showing depth heatmaps over each frame) and a threed slice (showing the merged scene in the 3D viewer). A second checkpoint, VGGT-Omega-1B-256-Text, also produces a 2048-dimensional scene-level embedding alongside the depth output, which the README shows being indexed with fiftyone.brain.compute_similarity for nearest-neighbour scene search. The repo ends with a BibTeX citation for the underlying CVPR 2026 paper.
Generated 2026-05-21 · Model: sonnet-4-6 · Verify against the repo before relying on details.