open-x-humanoid/hex

Analysis updated 2026-06-24

★ 13Jupyter NotebookAudience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((HEX))
    Inputs
      Camera frames
      Language instruction
      Proprioceptive state
    Outputs
      Arm and hand motions
      Waist motions
      Leg controller commands
    Use Cases
      Humanoid manipulation research
      Cross-embodiment policy training
      VLA fine-tuning
    Tech Stack
      Python
      PyTorch
      QwenVL
      FlashAttention
      CUDA

mindmap root((HEX)) Inputs Camera frames Language instruction Proprioceptive state Outputs Arm and hand motions Waist motions Leg controller commands Use Cases Humanoid manipulation research Cross-embodiment policy training VLA fine-tuning Tech Stack Python PyTorch QwenVL FlashAttention CUDA

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Run the released HEX 2.4B checkpoint in eval_model.ipynb to test humanoid policies on your own data

USE CASE 2

Fine-tune the VLA model on a new humanoid platform using the cross-embodiment slot scheme

USE CASE 3

Pretrain a custom whole-body manipulation policy on the AgiBot World plus Humanoid Everyday mixture

USE CASE 4

Reproduce paper results on Unitree G1 or Tienkung robots

What is it built with?

PythonPyTorchQwenVLFlashAttentionCUDA

How does it compare?

	open-x-humanoid/hex	lfrincond/seismic_imaging26	onuralpszr/litert-lm-cookbook
Stars	13	13	13
Language	Jupyter Notebook	Jupyter Notebook	Jupyter Notebook
Setup difficulty	hard	hard	moderate
Complexity	5/5	4/5	3/5
Audience	researcher	researcher	developer

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 1day+

Needs CUDA GPU, FlashAttention 2 wheels matched to your card, EGL/Mesa system libs, and Hugging Face downloads of both HEX and Qwen3-VL checkpoints.

In plain English

HEX is research code from the Open-X-Humanoid project that goes with a paper titled Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation. In plain terms, it is a control system for full sized humanoid robots that takes camera input plus a language instruction and decides how the robot should move. The README describes it as a vision language action framework, with a 2.4 billion parameter model released on Hugging Face under the name HEX-model. The model is built from three parts. There is a Qwen-VL backbone, which is a pretrained vision and language model that reads images and text. There is a unified proprioceptive predictor, which takes the robot's own joint and sensor readings and lines them up across different robot bodies. And there is a flow matching action head, which outputs the next stretch of continuous arm, hand, and waist motions. A separate reinforcement learning controller handles the legs and follows high level commands from the main policy, which is meant to keep the robot stable while it manipulates objects. A key claim is cross embodiment training. The team aligns data from several different humanoid platforms, including the Tienkung series, Unitree G1, Unitree H1, and Leju Kuavo, into shared body part slots so the policy learns one set of dynamics that transfers across the different machines. The training mixture pulls from their own released dataset and from public sets like Humanoid Everyday, AgiBot World Colosseo with the TrajBooster retargeting, and RoboCOIN, with links to each on Hugging Face. The install path is conda based. You clone the repo, create a Python 3.10 environment, apt install some EGL and Mesa system libraries, pip install the requirements, install FlashAttention 2, and then pip install -e the package itself. The README includes a fallback recipe for newer GPUs like an RTX 5090 where the prebuilt wheels for FlashAttention may not match, and points readers at the official wheels page. To run inference you download the HEX checkpoint and the Qwen3-VL base model from Hugging Face, point the config.yaml at your local Qwen path, and open a Jupyter notebook called eval_model.ipynb that the team ships in the notebooks folder. For pretraining and fine tuning there are bash scripts under scripts/ where you set the base VLM path, the data root, and a dataset mixture name that has to match the entries listed in the dataloader files. The team notes that data collection code for the Tienkung robots cannot be released due to commercial restrictions, and points users who want to gather data on Unitree G1 to two outside open source teleoperation projects, OpenTrajBooster and Psi0.

Copy-paste prompts

Prompt 1

Set up the conda env with Python 3.10, FlashAttention 2, and EGL libs needed to run HEX on a single RTX 5090

Prompt 2

Edit config.yaml to point at my local Qwen3-VL checkpoint and run eval_model.ipynb against a Unitree G1 episode

Prompt 3

Add a new dataset mixture entry under scripts/ and wire it into the dataloader for fine-tuning HEX

Prompt 4

Explain how the unified proprioceptive predictor aligns joint readings across Tienkung, Unitree G1, H1, and Leju Kuavo

Prompt 5

Swap the flow matching action head with a diffusion head and benchmark on Humanoid Everyday

Frequently asked questions

What is hex?

Vision language action framework for humanoid robots. A 2.4B parameter model takes camera input plus a text instruction and outputs whole-body manipulation actions across multiple robot bodies.

What language is hex written in?

Mainly Jupyter Notebook. The stack also includes Python, PyTorch, QwenVL.

How hard is hex to set up?

Setup difficulty is rated hard, with roughly 1day+ to a first successful run.

Who is hex for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.