explaingit

open-x-humanoid/hex

14Jupyter NotebookAudience · researcherComplexity · 5/5ActiveSetup · hard

TLDR

Vision language action framework for humanoid robots. A 2.4B parameter model takes camera input plus a text instruction and outputs whole-body manipulation actions across multiple robot bodies.

Mindmap

mindmap
  root((HEX))
    Inputs
      Camera frames
      Language instruction
      Proprioceptive state
    Outputs
      Arm and hand motions
      Waist motions
      Leg controller commands
    Use Cases
      Humanoid manipulation research
      Cross-embodiment policy training
      VLA fine-tuning
    Tech Stack
      Python
      PyTorch
      QwenVL
      FlashAttention
      CUDA

Things people build with this

USE CASE 1

Run the released HEX 2.4B checkpoint in eval_model.ipynb to test humanoid policies on your own data

USE CASE 2

Fine-tune the VLA model on a new humanoid platform using the cross-embodiment slot scheme

USE CASE 3

Pretrain a custom whole-body manipulation policy on the AgiBot World plus Humanoid Everyday mixture

USE CASE 4

Reproduce paper results on Unitree G1 or Tienkung robots

Tech stack

PythonPyTorchQwenVLFlashAttentionCUDA

Getting it running

Difficulty · hard Time to first run · 1day+

Needs CUDA GPU, FlashAttention 2 wheels matched to your card, EGL/Mesa system libs, and Hugging Face downloads of both HEX and Qwen3-VL checkpoints.

In plain English

HEX is research code from the Open-X-Humanoid project that goes with a paper titled Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation. In plain terms, it is a control system for full sized humanoid robots that takes camera input plus a language instruction and decides how the robot should move. The README describes it as a vision language action framework, with a 2.4 billion parameter model released on Hugging Face under the name HEX-model. The model is built from three parts. There is a Qwen-VL backbone, which is a pretrained vision and language model that reads images and text. There is a unified proprioceptive predictor, which takes the robot's own joint and sensor readings and lines them up across different robot bodies. And there is a flow matching action head, which outputs the next stretch of continuous arm, hand, and waist motions. A separate reinforcement learning controller handles the legs and follows high level commands from the main policy, which is meant to keep the robot stable while it manipulates objects. A key claim is cross embodiment training. The team aligns data from several different humanoid platforms, including the Tienkung series, Unitree G1, Unitree H1, and Leju Kuavo, into shared body part slots so the policy learns one set of dynamics that transfers across the different machines. The training mixture pulls from their own released dataset and from public sets like Humanoid Everyday, AgiBot World Colosseo with the TrajBooster retargeting, and RoboCOIN, with links to each on Hugging Face. The install path is conda based. You clone the repo, create a Python 3.10 environment, apt install some EGL and Mesa system libraries, pip install the requirements, install FlashAttention 2, and then pip install -e the package itself. The README includes a fallback recipe for newer GPUs like an RTX 5090 where the prebuilt wheels for FlashAttention may not match, and points readers at the official wheels page. To run inference you download the HEX checkpoint and the Qwen3-VL base model from Hugging Face, point the config.yaml at your local Qwen path, and open a Jupyter notebook called eval_model.ipynb that the team ships in the notebooks folder. For pretraining and fine tuning there are bash scripts under scripts/ where you set the base VLM path, the data root, and a dataset mixture name that has to match the entries listed in the dataloader files. The team notes that data collection code for the Tienkung robots cannot be released due to commercial restrictions, and points users who want to gather data on Unitree G1 to two outside open source teleoperation projects, OpenTrajBooster and Psi0.

Copy-paste prompts

Prompt 1
Set up the conda env with Python 3.10, FlashAttention 2, and EGL libs needed to run HEX on a single RTX 5090
Prompt 2
Edit config.yaml to point at my local Qwen3-VL checkpoint and run eval_model.ipynb against a Unitree G1 episode
Prompt 3
Add a new dataset mixture entry under scripts/ and wire it into the dataloader for fine-tuning HEX
Prompt 4
Explain how the unified proprioceptive predictor aligns joint readings across Tienkung, Unitree G1, H1, and Leju Kuavo
Prompt 5
Swap the flow matching action head with a diffusion head and benchmark on Humanoid Everyday
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.