amoghshrivastava/vlajepa

Analysis updated 2026-05-18

★ 2PythonAudience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((vlaJEPA))
    What it does
      Fine-tunes VLA-JEPA
      SO-101 arm support
      Pick-and-place training
    Model components
      Qwen3-VL-2B backbone
      V-JEPA2 vision encoder
      DiT-B action head
    Scripts provided
      Environment setup
      Weight download
      Dataset conversion
      Smoke test
      Full training run
    Requirements
      H100 or L40S GPU
      Conda
      Hugging Face datasets

mindmap root((vlaJEPA)) What it does Fine-tunes VLA-JEPA SO-101 arm support Pick-and-place training Model components Qwen3-VL-2B backbone V-JEPA2 vision encoder DiT-B action head Scripts provided Environment setup Weight download Dataset conversion Smoke test Full training run Requirements H100 or L40S GPU Conda Hugging Face datasets

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Fine-tune VLA-JEPA on the SO-101 pick-and-place dataset to train a robot arm on grasping tasks.

USE CASE 2

Study how to adapt an existing VLA model to a new robot embodiment by adding joint-space data configuration.

USE CASE 3

Run a 50-step smoke test to validate a VLA-JEPA training environment before committing to a full training run.

What is it built with?

PythonPyTorchCondaHugging FaceDeepSpeed

How does it compare?

	amoghshrivastava/vlajepa	0-bingwu-0/live-interpreter	0xkaz/llm-governance-dashboard
Stars	2	2	2
Language	Python	Python	Python
Setup difficulty	hard	moderate	hard
Complexity	5/5	2/5	4/5
Audience	researcher	general	ops devops

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Requires an H100 or L40S GPU, no consumer GPU or CPU fallback is mentioned.

No license is stated in the README.

In plain English

VLA-JEPA is a research model architecture that combines a language-capable vision model with a visual prediction encoder and an action generation head, designed to control physical robots. This repository is a reproduction pipeline that makes the original VLA-JEPA codebase work for a specific robot arm type called the SO-101, using a standard pick-and-place dataset. The original VLA-JEPA research code supports several well-known robot datasets but did not include support for the SO-101, which uses joint-space control. This repository adds the configuration files, dataset adapters, and bug fixes needed to fill that gap, then provides shell scripts to automate the entire setup process. The pipeline works in sequence: a setup script creates a Python environment and applies patches to the upstream code. You then download the two pretrained model weights the system builds on (a 2-billion parameter vision-language model and a large vision encoder) and the SO-101 pick-and-place dataset from Hugging Face. If the downloaded dataset is in the newer v3 format, a conversion script reformats it into the older per-episode layout the training code expects. A smoke test script runs 50 training steps to confirm everything works before a full run. The system is sized to run on a single H100 or L40S class GPU. No consumer GPU path is mentioned. A separate document in the repository (claudePRD.md) covers the build rationale, cost planning, known issues, and troubleshooting notes. This repository implements the approach from an academic paper at arxiv 2602.10098. No license is stated in the README.

Copy-paste prompts

Prompt 1

I'm setting up vlaJEPA on an H100 and the smoke test fails with a DeepSpeed single-process launch error. What should I check in the patched launch configuration?

Prompt 2

Help me write a new DataConfig for a robot arm other than SO-101 to plug into the VLA-JEPA training pipeline in this repo.

Prompt 3

The convert_v3_to_v2.py script converts LeRobot dataset formats. Explain the structural difference between the v3 packed multi-episode layout and the v2 per-episode layout.

Prompt 4

I finished a training run with vlaJEPA. How do I use inference_check.py to load my checkpoint and interpret the predicted action vs ground truth comparison?

Frequently asked questions

What is vlajepa?

A fine-tuning pipeline that adapts the VLA-JEPA robot-control model to work with the SO-101 arm using a pick-and-place dataset, sized for a single H100-class GPU.

What language is vlajepa written in?

Mainly Python. The stack also includes Python, PyTorch, Conda.

What license does vlajepa use?

No license is stated in the README.

How hard is vlajepa to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is vlajepa for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub amoghshrivastava on gitmyhub

Verify against the repo before relying on details.