explaingit

dexoravla/dexora

Analysis updated 2026-06-24

81PythonAudience · researcherComplexity · 5/5LicenseSetup · hard

TLDR

Vision-Language-Action research codebase for a 36-DOF bimanual robot, with teleoperation, a Hugging Face dataset, a three-stage Diffusion Transformer training pipeline, and ZMQ inference.

Mindmap

mindmap
  root((Dexora))
    Inputs
      Exoskeleton motion
      Apple Vision Pro hand tracking
      Camera images
    Outputs
      36-joint actions
      Trained policy
      MuJoCo rollouts
    Use Cases
      Train a VLA policy
      Replicate ICRA 2026 paper
      Drive a real bimanual robot
    Tech Stack
      PyTorch
      MuJoCo
      SigLIP
      T5
      ZMQ
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Reproduce the ICRA 2026 Dexora training pipeline on the public Hugging Face dataset.

USE CASE 2

Collect new teleoperation demonstrations with an exoskeleton and Apple Vision Pro hand tracking.

USE CASE 3

Fine-tune a Diffusion Transformer policy with a quality-weighted loss using a learned discriminator.

USE CASE 4

Run the three-process ZMQ inference stack on an AIRBOT plus XHAND bimanual robot.

What is it built with?

PyTorchMuJoCoSigLIPT5ZMQ

How does it compare?

dexoravla/dexoraclawdbrunner/captcha-solveraqua5230/usage
Stars818182
LanguagePythonPythonPython
Setup difficultyhardmoderateeasy
Complexity5/53/52/5
Audienceresearcherdeveloperdeveloper

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Needs about 48 GB of disk for SigLIP and T5-v1.1-XXL encoders, pinned library versions, and three separate Python environments for the robot SDKs.

MIT license, very permissive, use freely with attribution.

In plain English

Dexora is the source code release for a research robot system that the authors call a Vision-Language-Action model, or VLA. The system is designed to drive a robot with two arms and two hands that together have 36 degrees of freedom, meaning 36 independent joint angles the policy has to control. The repository accompanies a paper accepted at ICRA 2026, a major robotics conference, and includes the full training, inference, data processing, and teleoperation code. The project is built around four main pieces. The first is a way of collecting human demonstrations: an operator wears an exoskeleton backpack that captures broad arm motion, while an Apple Vision Pro headset tracks finger movement without markers. These signals drive both the real robot and a copy of it simulated in MuJoCo, a physics engine. The second piece is a dataset of those demonstrations, hosted on Hugging Face. The README lists 12.2 thousand real-world episodes covering about 40 hours of teleoperation, plus 100 thousand simulated trajectories using the same 36-joint body layout. The third piece is the training pipeline. It runs in three stages: pretrain a Diffusion Transformer policy, train a separate discriminator that scores how good each demonstration clip is, then fine-tune the policy with a weighted loss that pays less attention to low-quality demonstrations. Shell scripts launch each stage and read paths from environment variables. The setup relies on two large pretrained encoders, SigLIP for vision and T5-v1.1-XXL for language, which together take roughly 48 GB of disk. The fourth piece is the inference stack that runs on the actual robot. It splits work across three Python processes that talk over ZMQ, because the GPU policy, the AIRBOT arm SDK, and the XHAND hand SDK each need conflicting Python environments and cannot share one process. There is also a CPU-only test suite of 57 tests for quick checks. The code is released under an MIT license and the README pins specific versions of PyTorch, transformers, diffusers, accelerate, and LeRobot because newer versions break the interfaces the training stack expects.

Copy-paste prompts

Prompt 1
Set up Dexora with the pinned PyTorch, transformers, diffusers, accelerate, and LeRobot versions, then run the 57 CPU-only tests.
Prompt 2
Run the Dexora stage-one Diffusion Transformer pretraining script against the Hugging Face dataset and explain each environment variable.
Prompt 3
Run the Dexora discriminator training stage and then the weighted fine-tune, showing how the discriminator scores feed into the loss.
Prompt 4
Walk through the three-process ZMQ inference layout in Dexora and explain why the GPU policy, AIRBOT SDK, and XHAND SDK need separate Python envs.
Prompt 5
Adapt the Dexora MuJoCo simulator to a different bimanual robot body while keeping the 36-joint convention.

Frequently asked questions

What is dexora?

Vision-Language-Action research codebase for a 36-DOF bimanual robot, with teleoperation, a Hugging Face dataset, a three-stage Diffusion Transformer training pipeline, and ZMQ inference.

What language is dexora written in?

Mainly Python. The stack also includes PyTorch, MuJoCo, SigLIP.

What license does dexora use?

MIT license, very permissive, use freely with attribution.

How hard is dexora to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is dexora for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.