Train a diffusion policy for visual robot navigation on RECON or SCAND
Fine-tune a goal-conditioned waypoint predictor on a custom dataset
Reproduce a V-JEPA2 plus diffusion baseline against visualnav-transformer
Convert ROS bag navigation data into the vint_train folder layout
Needs a CUDA GPU, conda environment, a git submodule for diffusion_policy, and dataset preprocessing including rosbag Python bindings for non-RECON sets.
DiffusionNav is research code for teaching a wheeled robot where to drive next based only on what its front camera sees. The system takes a short clip of recent camera frames and, optionally, a single goal image showing where the robot should end up, and it predicts a sequence of future waypoints, meaning xy positions the robot should move through. It does this with two main pieces. The first is V-JEPA2, a vision model the authors load from PyTorch Hub and freeze in place without further training. V-JEPA2 turns the raw frames and the goal image into a compact feature representation. The second piece is a diffusion policy, which is a small one dimensional U-Net that starts from random noise and gradually refines it into a clean waypoint trajectory, using the visual features as a hint. There is also a fusion transformer between the two, and a goal embedding path, plus a classifier free guidance trick that lets the model run with or without the goal at inference time. The project is heavily inspired by an earlier line of work called visualnav-transformer (NoMaD, ViNT, GNM). Several of the training utilities, dataset formats, and configuration files are reused as is from that codebase, kept in a directory called vint_train so that existing data layouts still work. The actual diffusion policy library is pulled in as a git submodule from the Stanford diffusion_policy repo. The README walks through how to set things up. You clone the repository with submodules, create a conda environment from the provided yaml file, and pip install the diffusion policy package. The data side supports several public navigation datasets including RECON, TartanDrive, SCAND, GoStanford2, and SACSoN. There is a preprocessing script for RECON that converts the original HDF5 release into a per trajectory folder layout with numbered jpg frames and a pickle of robot positions and yaw values. ROS bag files for the other datasets can be converted with a generic script that needs the rosbag Python bindings. Once the data is laid out, a script splits it into train and test partitions, a yaml config points at the processed folders, and a single python command starts training. Inference scripts then run the trained model on a clip or a batch of clips and render the predicted waypoints onto a video. The repo also includes a smoke test and a fake data generator for quick checks.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.