explaingit

adipotnis/diffusionnav

0PythonAudience · researcherComplexity · 5/5ActiveSetup · hard

TLDR

Research code that teaches a wheeled robot to predict future waypoints from front camera frames using a frozen V-JEPA2 vision model and a diffusion policy.

Mindmap

mindmap
  root((DiffusionNav))
    Inputs
      Recent camera frames
      Optional goal image
      Public nav datasets
    Outputs
      Predicted xy waypoints
      Rendered trajectory video
    Use Cases
      Visual robot navigation
      Goal conditioned policy training
      Diffusion policy research
    Tech Stack
      Python
      PyTorch
      V-JEPA2
      Diffusion Policy
      Conda

Things people build with this

USE CASE 1

Train a diffusion policy for visual robot navigation on RECON or SCAND

USE CASE 2

Fine-tune a goal-conditioned waypoint predictor on a custom dataset

USE CASE 3

Reproduce a V-JEPA2 plus diffusion baseline against visualnav-transformer

USE CASE 4

Convert ROS bag navigation data into the vint_train folder layout

Tech stack

PythonPyTorchDiffusionCondaROS

Getting it running

Difficulty · hard Time to first run · 1day+

Needs a CUDA GPU, conda environment, a git submodule for diffusion_policy, and dataset preprocessing including rosbag Python bindings for non-RECON sets.

In plain English

DiffusionNav is research code for teaching a wheeled robot where to drive next based only on what its front camera sees. The system takes a short clip of recent camera frames and, optionally, a single goal image showing where the robot should end up, and it predicts a sequence of future waypoints, meaning xy positions the robot should move through. It does this with two main pieces. The first is V-JEPA2, a vision model the authors load from PyTorch Hub and freeze in place without further training. V-JEPA2 turns the raw frames and the goal image into a compact feature representation. The second piece is a diffusion policy, which is a small one dimensional U-Net that starts from random noise and gradually refines it into a clean waypoint trajectory, using the visual features as a hint. There is also a fusion transformer between the two, and a goal embedding path, plus a classifier free guidance trick that lets the model run with or without the goal at inference time. The project is heavily inspired by an earlier line of work called visualnav-transformer (NoMaD, ViNT, GNM). Several of the training utilities, dataset formats, and configuration files are reused as is from that codebase, kept in a directory called vint_train so that existing data layouts still work. The actual diffusion policy library is pulled in as a git submodule from the Stanford diffusion_policy repo. The README walks through how to set things up. You clone the repository with submodules, create a conda environment from the provided yaml file, and pip install the diffusion policy package. The data side supports several public navigation datasets including RECON, TartanDrive, SCAND, GoStanford2, and SACSoN. There is a preprocessing script for RECON that converts the original HDF5 release into a per trajectory folder layout with numbered jpg frames and a pickle of robot positions and yaw values. ROS bag files for the other datasets can be converted with a generic script that needs the rosbag Python bindings. Once the data is laid out, a script splits it into train and test partitions, a yaml config points at the processed folders, and a single python command starts training. Inference scripts then run the trained model on a clip or a batch of clips and render the predicted waypoints onto a video. The repo also includes a smoke test and a fake data generator for quick checks.

Copy-paste prompts

Prompt 1
Set up DiffusionNav with conda and the diffusion_policy git submodule on an Ubuntu machine with a single GPU
Prompt 2
Preprocess the RECON dataset HDF5 release into the per-trajectory jpg and pickle layout this repo expects
Prompt 3
Explain how classifier-free guidance is wired between the goal embedding path and the U-Net here
Prompt 4
Write a yaml config that points training at my own ROS bag dataset converted with the generic script
Prompt 5
Run the smoke test and fake data generator to verify my install before kicking off real training
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.